Home (re) Insurance Latin America Power Plants – Maintenance Practices and RCA of equipment failures

Power Plants – Maintenance Practices and RCA of equipment failures

38 min read

In casual discussions among Electrical Engineers I am often asked to opine on the significance of the various types of equipment maintenance in Power Plants, as well as on the investigation stages, after an equipment failure has occurred.

This article intends to provide insights on these matters, in a clear, methodic and systematic fashion.

To begin with, where did the term “Maintenance” come from, and how did it develop through time?

British Standard Glossary of terms (3811:1993) defines maintenance as: “The combination of all technical and administrative actions, including supervision actions, intended to retain an item in, or restore it to, a state in which it can perform a required function”.

Today, we have a more accurate definition for maintenance: “A set of organized activities that are carried out in order to keep an item in its best operational condition with minimum cost acquired. Activities of maintenance function could be either repair or replacement activities, which are necessary for an item to reach its acceptable productivity condition and these activities should be carried out with a minimum possible cost”.

In the period of pre-World War II, people thought of maintenance as an added cost to the plant which did not increase the value of the finished product. Therefore, the maintenance at that era was restricted to fixing the unit when it breaks because it was the cheapest alternative.

During and after World War II at the time when the advances of engineering and scientific technology developed, people developed other types of maintenance, which were much cheaper such as preventive maintenance. In addition, people in this era classified maintenance as a function of the production system.

Source: DAC and HTME (Haward Technology Middle East)


Maintenance History (Adapted From Shenoy, Bhadury 1998)

Nowadays, increased awareness of such issues as environment safety, quality of product and services makes maintenance one of the most important functions that contribute to the success of the industry.

World-class companies are in continuous need of a very well-organized maintenance program to compete worldwide.

Power Plants, in particular, are included in the critical infrastructures list of every nation. As such, they heavily bear the burden of producing adequate and stable electricity, with ideally no downtime, no emergency shutdowns and no unplanned outages, no matter what and whatever it takes. In the real world we live in, this is very hard to achieve, and that is where maintenance plays a tremendously significant role, in order to maximize efficiency, availability and reliability.

Power Plants comprise not only of key equipment, such as turbines, generators, transformers and switchgears, but also of other main and auxiliary equipment skids (BOP – Balance of Plant) that are critical for the uninterruptible operation of the Plant. As an example, instrument air compressors/dryers, lubrication oil purification systems, fuel forwarding pumps, HRSGs, Emissions Monitoring Systems, Water Treatment Plants, Waste Water Treatment Plants, HVAC systems, Fire Detection & Fire Fighting systems, SCADA, etc., are some of the pieces of equipment that ensure the reliable and consistent operation of a Power Plant.

To keep a Power Plant going, every single piece of the above referenced equipment is required to be included in the Plant’s maintenance schedule, as per the respective manufacturer’s O&M instructions, in order to retain their nominal performance, their calculated life expectancy, to maximize their availability and reduce their downtime.

We have to be absolute and determined when setting a Power Plant’s maintenance objectives. Maintenance objectives should be consistent with and subordinate to production goals. The relation between maintenance objectives and production goals is reflected in the action of keeping production machines and facilities in the best possible condition.

An effective set of maintenance objectives is shown in the below list:
1. Maximizing production or increasing facilities availability at the lowest cost and at the highest quality and safety standards.
2. Reducing breakdowns and emergency shutdowns.
3. Optimizing resources utilization.
4. Reducing downtime.
5. Improving spares stock control.
6. Improving equipment efficiency and reducing scrap rate.
7. Minimizing energy usage.
8. Optimizing the useful life of equipment.
9. Providing reliable cost and budgetary control.
10. Identifying and implementing cost reductions.

This set of maintenance objectives can be visually displayed and summarized as:

Source: DAC and HTME (Haward Technology Middle East)
Source: DAC and HTME (Haward Technology Middle East)

Maintenance Objectives

There are five (5) types of maintenance:

1. Run to Failure Maintenance (RTF)
2. Preventive Maintenance (PM)
3. Corrective Maintenance (CM)
4. Improvement Maintenance (IM)
5. Predictive Maintenance (PDM)




The required repair, replacement, or restore action performed on a machine or a facility after the occurrence of a failure in order to bring this machine or facility to at least its minimum acceptable condition. It is the oldest type of maintenance.
It is subdivided into two types:

Emergency maintenance: it is carried out as fast as possible in order to bring a failed machine or facility to a safe and operationally efficient condition.

Breakdown maintenance: it is performed after the occurrence of an advanced considered failure for which advanced provision has been made in the form of repair method, spares, materials, labor and equipment.
a. Its activities are expensive in terms of both direct and indirect cost.
b. Using this type of maintenance, the occurrence of a failure in a component can cause failures in other components in the same equipment, which leads to low production availability.
c. Its activities are very difficult to plan and schedule in advance. This type of maintenance is useful in the following situations:
• The failure of a component in a system is unpredictable.
• The cost of performing run to failure maintenance activities is lower than performing other activities of other types of maintenance.
• The equipment failure priority is too low in order to include the activities of preventing it within the planned maintenance budget.



It is a set of activities that are performed on plant equipment, machinery, and systems before the occurrence of a failure in order to protect them and to prevent or eliminate any degradation in their operating conditions.

British Standard 3811:1993 Glossary of terms defined preventive maintenance as: the maintenance carried out at predetermined intervals or according to prescribed criteria and intended to reduce the probability of failure or the degradation of the functioning and the effects limited.

The advantage of applying preventive maintenance activities is to satisfy most of maintenance objectives.

The factors that affect the efficiency of this type of maintenance:

a) The need for an adequate number of staff in the maintenance department in order to perform this type of maintenance.
b) The right choice of production equipment and machinery that is suitable for the working environment and that can tolerate the workload of this environment.
c) The required staff qualifications and skills, which can be gained through training.
d) The support and commitment from executive management to the PM program.
e) The proper planning and scheduling of PM program.
f) The ability to properly apply the PM program.
It is good for those machines and facilities which their failure would cause serious production losses. Its aim is to maintain machines and facilities in such a condition that breakdowns and emergency repairs are minimized.
Its activities include replacements, adjustments, major overhauls, inspections and lubrications.


Image source: DAC Consulting


In this type, actions such as repair, replacement, or restore will be carried out after the occurrence of a failure in order to eliminate the source of this failure or reduce the frequency of its occurrence.

In the British Standard 3811:1993 Glossary of terms, corrective maintenance is defined as: the maintenance carried out after recognition and intended to put an item into a state in which it can perform a required function.

This type of maintenance is subdivided into three types:

a) Remedial maintenance, which is a set of activities that are performed to eliminate the source of failure without interrupting the continuity of the production process. The way to carry out this type of corrective maintenance is by taking the item to be corrected out of the production line and replacing it with reconditioned items or transferring its workload to its redundancy.
b) Deferred maintenance, which is a set of corrective maintenance activities that are not immediately initiated after the occurrence of a failure but are delayed in such a way that will not affect the production process.
c) Shutdown corrective maintenance, which is a set of corrective maintenance activities that are performed when the production line is in total stoppage situation.
The main objectives of corrective maintenance are the maximization of the effectiveness of all critical plant systems, the elimination of breakdowns, the elimination of unnecessary repair, and the reduction of the deviations from optimum operating conditions.
The difference between corrective maintenance and preventive maintenance is that for the corrective maintenance, the failure should occur before any corrective action is taken.
Corrective maintenance is different from run to failure maintenance in that its activities are planned and regularly taken out to keep plant’s machines and equipment in optimum operating condition.


It aims at reducing or eliminating entirely the need for maintenance. This type of maintenance is subdivided into three types as follows:

a) Design-out maintenance which is a set of activities that are used to eliminate the cause of maintenance, simplify maintenance tasks, or raise machine performance from the maintenance point of view by redesigning those machines and facilities which are vulnerable to frequent occurrence of failure and their long term repair or replacement cost is very expensive.
b) Engineering services which include construction and construction modification, removal and installation, and rearrangement of facilities.
c) Shutdown improvement maintenance, which is a set of improvement maintenance activities that are performed while the production line is in a complete stoppage situation.


Predictive maintenance is a set of activities that detect changes in the physical condition of equipment (signs of failure) in order to carry out the appropriate maintenance work for maximizing the service life of equipment without increasing the risk of failure.

It is classified into two kinds according to the methods of detecting the signs of failure:

• Condition-based predictive maintenance
• Statistical-based predictive maintenance

a) Condition-based predictive maintenance depends on continuous or periodic condition monitoring equipment to detect the signs of failure.
b) Statistical-based predictive maintenance depends on statistical data from the meticulous recording of the stoppages of the in-plant items and components in order to develop models for predicting failures.

The drawback of predictive maintenance is that it depends heavily on information and the correct interpretation of the information. Some researchers classified predictive maintenance as a type of preventive maintenance.

The main difference between preventive maintenance and predictive maintenance is that predictive maintenance uses monitoring the condition of machines or equipment to determine the actual mean time to failure whereas preventive maintenance depends on industrial average life statistics.

In short, the types of maintenance for Power Plants equipment can be summarized in the below diagram:


Soruce: DAC and HTME (Haward Technology Middle East)


For all of the above to work for a given installation (power plant), top management should be decisive on the maintenance strategy to be followed for their power plant’s assets, depending on the criticality of main and auxiliary equipment or skids (BOP), and the criticality of the plant itself, as well as of the loads it serves.

Due to the fact that in almost all power plant construction contracts, key equipment technical specifications include a predetermined minimum expected service life (e.g. 30 or 40 years), this, in fact, serves as a “guide value” for the schema of maintenance strategy. Key equipment vendors are obliged to provide a life expectancy study for their machines, which takes into consideration various factors, such as operating conditions, ambient conditions, vibrations, corrosive atmosphere, etc., and take for granted that the manufacturer’s O&M instructions are always followed to the letter.

Modern power plants are usually supported by a suitable online condition monitoring and performance analysis system (frequently referred to as MDPAS), which provides maintenance engineers with the tools required to assist them decide on next actions, with regard to the Predictive Maintenance (PDM) aforementioned activities.

PDM’s major economic advantage is that a latent issue, or an evolving defect, are detected at their earliest stages, therefore repair or replacement works are taking place when it is easier and cheaper to perform, before other parts of the equipment are affected and lead to more downtime, lack of spare parts, loss of revenue and maybe even Business Interruption.

Adhering to this maintenance strategy (PDM) allows for retaining the good operating condition, performance and the initial service life expectancy of the equipment, which mean a lot to the Business Owner, in economic terms.

As I’ve mentioned earlier, PDM requires a lot of available information and its correct interpretation, to perform as expected. Thankfully, computing technology, data storage facilities, high speed networking and suitable communication protocols allow for maintaining a complete Database of the Power Plant’s operational data. MDPAS-like systems being interconnected with this Database and at the same time relying on statistical maintenance data provided by equipment vendors, allow for reliable decision making, by comparing online operational data to past data combinations, which would indicate a potential problem or latent defect.

Most advanced power plant installations may also incorporate Artificial Intelligence (AI) to “learn” from the decision making processes, provided they are fed with the necessary feedback by the responsible engineer, as far as the actions taken, the outcome, spares/consumables used, personnel required, etc., are concerned. AI utilization in maintenance strategies speeds up all relevant procedures (early detection, decision making, actions) and allows for automating most of the maintenance administration tasks.

Last, but not least, let’s not discredit the value that Asset Management may add to the Power Plant’s activities, including of course, the maintenance activities. With a variety of Asset Management software suites available in the market today, fully customizable to any industry, including power plants, investing in such a system and including it in the core of maintenance strategy, is expected to pay back soon and help maintenance engineers execute their duties in a far more confident fashion.

“We have all of the above in place, we follow closely our maintenance schedule, but we suffered a significant equipment failure!! What did we do wrong and what are the next steps to take?”

No matter how hard you try, how strictly you follow your power plant’s maintenance schedules, how well your MDPAS tools are performing … there will always be factors beyond your control that may, under specific circumstances, lead to a minor, or major equipment failure.

These can either be:

a. External electrical stresses/abnormalities
b. Design and/or manufacturing defects
c. Human error (both in Operation and in Maintenance)
d. Unusual service conditions
e. Transients
f. Unrevealed equipment vulnerabilities
g. Malicious acts
or even any combination of them.

It is expected that power plant equipment failures, especially those of key equipment, will impose a huge amount of anxiety and frustration to the personnel responsible for their O&M, especially if the failure leads to Business Interruption, loss of revenue or, in the worst case scenario, power outage in the vicinity of the power plant.

The most reliable method to deal with such events is by applying a Root Cause Analysis (RCA), in order to determine the events, the conditions and the mechanisms that led to the failure. A definitive threshold for performing RCA for an event shall be established, as part of the overall operating plant strategy.

Root Cause Analysis is performed in six (6) main steps:

a. Definition of the failure event.
b. Forensic Investigation/Analysis to reveal the various causes of the failure.
c. Determination of the most probable (Root) Cause that triggered the sequence of events, which eventually led to the failure.
d. Definition of the most suitable solutions, which can lead to mitigation, or at best, elimination of the Root Cause.
e. Selection and application of the most affordable solution defined in step d.
f. Verification of selected and applied solution effectiveness.

Performing the forensic investigation/analysis as part of the RCA, involves failure troubleshooting which shall be performed independently of the Event Investigation team. Troubleshooting is a logical, systematic, experience based process to identify the failures, malfunction(s) or their symptoms within a technical system, and to determine and eliminate their causes.

If at any time during the performance of the root cause investigation additional adverse conditions are identified, then individual condition reports should be initiated.

Careful preservation of the evidence is very important in the determination of the actual cause of an issue or event. In cases where malicious intents are suspected, it may be necessary to get Security involved.

The steps to performing the actual investigation are:

a. Gathering of information. Thorough information gathering is important to the success of the investigation. Missed evidence can lead to improper or inadequate conclusions.
b. Creating an event timeline for discussion of facts. When investigating an event, it is typical to use an event timeline to help identify when the first causal factor had an impact on the event. The event timeline should begin at the first set of facts just prior to the first failure or inappropriate action.
c. Selecting the proper root cause techniques. Usually a combination of techniques, selected as the investigation progresses, will be necessary to ensure the effective analysis of the event. Regardless of the technique utilized, a basic event and causal factor chart should be used to help identify all failures or inappropriate actions.
d. Error precursors and failed defenses. When performing a root cause investigation that has an element of human error as part of the cause, it is very important to make the assumption that all individuals come to work to do the best job they can every day. During the course of the investigation it can be determined if malicious intent was a factor in the event. When human errors occur, root cause investigators must always put the errors in the proper perspective.
e. Repeat occurrence. Repeat events can only be determined once the root cause has been identified. For organizational or programmatic issues, repeat events are those in which the root cause and issue or event being investigated is similar to a significant event or issue from within the last few years.
f. Cause identification. Root cause(s) is the most fundamental reason for an event or adverse condition, which if corrected will effectively prevent or minimize recurrence of the event or condition. Contributing Cause(s) is a causal factor that exacerbated the problem but is not the root cause of the problem. Causal Factors are any action or condition either causing an event to occur or increase its severity.
g. Extent of cause / Extent of condition. The Extent of Cause is the extent to which the root causes of an identified problem have impacted other plant processes, equipment or human performance. The Extent of Condition is the extent to which the actual condition exists with other plant processes, equipment or human performance.

Next in the RCA procedural steps come the Corrective Actions and the Effectiveness Review Assignment.

Review and Approval of the RCA investigation comes from the Event Review Board and the responsible Manager.

A brief list of the tools and techniques, deriving mainly from Quality Management Systems applications, for performing a RCA for a failure event, follows hereinafter:

1. Interviewing
2. Task Analysis
3. Change Analysis
4. Barrier Analysis
5. Event and causal factor charting
6. Cause and effect analysis
7. Fault tree analysis
8. Event tree analysis
9. 5 whys (why staircase)
10. Common cause analysis
11. Current Reality Tree
12. Failure Mode and Effects Analysis (FMEA)
13. Human factor investigation tool
14. Psychological and physiological evaluation
15. Ergonomics analysis
16. Kepner – Tregoe Analysis
17. Interrelationship diagram (ID)
18. JNES Organizational Factors List (JOFL)

As a best practice, it is preferable to employ a reputable 3rd Party Technical Consultant to perform a proper, impartial, reliable and accurate Root Cause Analysis for a Power Plant equipment failure.

Experienced Specialists will effectively determine the highly likely Root Cause of your suffered equipment failure, and provide you with very useful feedback and recommendations for the most suitable, and at the same time affordable, corrective actions.

DAC Consulting Services specialists roster comprises of Technical Consultants of all industrial disciplines, with significant experience in performing RCAs of failure events, quantum and delay analyses, who are capable of supporting and resolving complex and unclear technical cases.

Stathis Maliakis BSc., BEE, MSc., QM
Electrical Engineer / Expert Witness
Rhodes / Dubai / London
DAC Consulting Services Ltd.


Other related articles:

RCA and reinstatement assessment of major equipment in power plants


Load More Related Articles
Load More By iNLIP
Load More In (re) Insurance Latin America

Check Also

LLoyd´s opens a new office in Miami

Lloyd’s Opening Miami Office to Support Growth in Latin America and Caribbean We at iNLIP …