Risk Management

How to Run an FMEA for High-Risk Maintenance in 60 Minutes

A practical F2 guide for senior EHS managers who need FMEA to change maintenance controls before isolation, lifting, line breaking, or restart exposure begins.

By 7 min read
risk management scene on how to run an fmea for high risk maintenance in 60 minutes — How to Run an FMEA for High-Risk Mainte

Key takeaways

  1. 01Use FMEA on one high-risk maintenance scenario, not on a broad maintenance category.
  2. 02Name failure modes as observable ways the task can fail, then connect each one to a credible effect.
  3. 03Treat scoring as secondary to evidence, because a guessed risk priority number cannot prove control quality.
  4. 04Convert the strongest rows into release criteria, hold points, and field-verification requirements.
  5. 05Use Headline Podcast conversations to help senior leaders challenge high-risk work before exposure reaches the crew.

High-risk maintenance often fails at the planning table, before a mechanic touches a flange, guard, valve, scaffold, hoist, or energized system. This guide shows senior EHS managers how to run a 60-minute FMEA that turns credible failure modes into control decisions before work starts.

What you need before starting

FMEA, or Failure Mode and Effects Analysis, works best when the team studies a specific task rather than a whole department. For high-risk maintenance, the scope should name the asset, energy source, exposed crew, planned work method, abnormal condition, and restart boundary. A vague scope produces vague failure modes.

The minimum team is operations, maintenance, EHS, the supervisor who owns the work, and one person who understands the equipment history. If contractors will perform the job, include the contractor supervisor before mobilization. The strongest input is not a template. It is the combination of work order, isolation plan, permit expectations, drawings, previous near misses, and field constraints.

On Headline Podcast, Andreza Araujo and Dr. Megan Tranter often bring leadership back to the same practical test: does the conversation change the next decision? This FMEA routine follows that test. It is not a documentation exercise. It should change what gets isolated, verified, delayed, redesigned, supervised, or escalated.

Step 1: Define the maintenance scenario tightly

Start by writing one maintenance scenario in plain language. A useful scenario sounds like this: replace a pump seal during a planned outage while adjacent equipment remains energized. That sentence gives the team a task, an asset, a timing condition, and an interface with live risk.

The common error is analyzing a category such as maintenance shutdown or line opening. Those labels are too broad because the failure mode changes with pressure, product, elevation, isolation quality, lifting path, access, weather, contractor interface, and restart sequence. A broad scenario can fill a wall with notes while leaving the job unchanged.

Use the existing JSA before high-risk work as a starting input, not as a substitute. The JSA asks what hazards are present and what controls are required. The FMEA asks how the planned method can fail, what effect that failure creates, and which control would prove the failure path is blocked.

Step 2: What can fail during the task?

Name failure modes as observable ways the task can go wrong. For high-risk maintenance, examples include wrong isolation point, trapped pressure, incorrect gasket material, dropped object, incompatible hose, scaffold access blocked, temporary bypass left in place, wrong torque sequence, damaged lifting accessory, or a restart step performed out of order.

Do not write failure modes as moral judgments. Carelessness, complacency, and lack of attention do not tell the team what to verify. They describe a person in general terms while hiding the field condition that can be controlled. James Reason's work on latent failures is useful here because many visible errors sit on top of design, planning, supervision, purchasing, or maintenance history.

A strong 60-minute session should produce 8 to 12 credible failure modes, not 40 weak ones. The team should stop adding items when the next item no longer changes a control, verification step, hold point, or owner.

Step 3: Connect each failure mode to a credible effect

For each failure mode, write the effect in injury, release, fire, explosion, exposure, or business-continuity language. Wrong isolation point may create unexpected energization. Trapped pressure may create chemical release. A damaged lifting accessory may create a dropped load. A bypass left in place may defeat a safeguard during restart.

The effect should be credible, not theatrical. The point is not to frighten the room into accepting every worst case. The point is to separate low-consequence nuisance failures from serious exposure that deserves stronger control proof. When every line receives the same severity language, leaders cannot see which decision matters most.

This is where many teams misuse the risk matrix in fatal-exposure decisions. A color score can help prioritize, although it cannot prove that isolation is correct, that a blind is installed, that a lift path is clear, or that a restart boundary is understood.

Step 4: List existing controls without pretending they all work

List the controls that already exist for each failure mode, including isolation, permit-to-work, lockout, bleed and drain, gas testing, line breaking method, lifting plan, scaffold tag, competent person review, pre-task briefing, and restart approval. Then ask whether each control is designed, available, understood, and verifiable at the point of work.

Andreza Araujo's co-host perspective is useful here because her book The Illusion of Compliance challenges the comfort leaders feel when documents appear complete. A permit can exist while the isolation is wrong. A checklist can be signed while the field condition has changed. A procedure can be current while the crew lacks the tool that makes the safe method realistic.

The practical test is simple. If the control cannot be seen, tested, confirmed, or owned before exposure starts, mark it as weak. Do not let the word control survive because it appears in the procedure.

Step 5: Score severity, occurrence, and detectability with evidence

Traditional FMEA uses severity, occurrence, and detectability to calculate a risk priority number. In high-risk maintenance, the scoring matters less than the evidence behind the score. Ask what incident history, inspection result, maintenance backlog, equipment condition, contractor experience, or control-verification record supports each number.

A team that guesses all three scores has not analyzed risk. It has negotiated confidence. When evidence is missing, mark the evidence gap instead of smoothing the number until the spreadsheet looks acceptable. That gap may become the first action before the work is released.

For a 60-minute review, use 3 scoring columns only and reserve debate for high-severity rows. The common trap is spending half the meeting arguing whether occurrence is 3 or 4 while nobody verifies the blind list, lift plan, gas test, or restart hold point.

Step 6: Which controls need proof before work starts?

Convert the highest-exposure rows into proof requirements. Proof means the team can show that the control is present and effective before the crew enters the exposure path. For isolation, proof may be a field walkdown, lock verification, zero-energy test, bleed status, and independent confirmation. For lifting, proof may be sling inspection, load weight confirmation, exclusion zone, weather check, and competent lift supervision.

This step connects FMEA to critical-control verification. A control named in an FMEA is still only an intention until someone verifies it under the conditions of the job. The best output is therefore not a longer worksheet. It is a short list of controls that must be proven before work starts.

In more than 250 cultural transformation projects connected to Andreza Araujo's work, one repeated weakness is confidence without proof. Teams believe a control exists because the system says it should exist. FMEA should challenge that habit by asking what evidence would convince a skeptical supervisor standing at the equipment.

Step 7: Assign owners by authority, not by convenience

Assign each action to the person or function with authority to change the condition. EHS may facilitate the review, but EHS cannot own a missing spare, poor access platform, delayed isolation window, contractor staffing gap, or engineering decision unless it controls the resource.

The ownership rule should be blunt. If the action changes the equipment, engineering owns it. If it changes timing, operations owns it. If it changes tooling or method, maintenance owns it. If it changes contractor readiness, procurement or the contract owner must be present. If it changes verification discipline, the supervisor and EHS share the field check.

This is also where the team should connect with the right change-control check. Some FMEA outputs are not maintenance actions at all. They are MOC triggers, pre-startup safety review issues, or field-verification requirements that need a different workflow.

Step 8: Close the review with release criteria

End the FMEA by writing release criteria for the job. Release criteria state what must be true before work starts, what must stop the job, who can restart it, and what evidence will be kept after completion. Without release criteria, the FMEA becomes advice that competes with schedule pressure.

For high-risk maintenance, release criteria may include verified isolation, approved lift plan, competent supervision, rescue or emergency arrangements, contractor briefing, permit quality check, hold point completion, and restart authorization. The list should fit the scenario rather than repeat every possible safety requirement.

Close the loop after the job. If the FMEA identified a failure mode and the job exposed a new one, update the method. If the team created an action, track whether it changed the next job. If the event still occurs, move it into corrective action triage before the organization accepts another weak fix.

FMEA vs JSA vs risk matrix: what changes in the decision?

MethodBest useWeak useDecision it should change
FMEATesting how a planned task, component, control, or sequence can failScoring a worksheet without field evidenceProof required before work starts
JSA or JHABreaking a job into steps, hazards, and controlsRepeating generic hazards that do not fit the taskWork method and task controls
Risk matrixPrioritizing risk conversationsTreating color as proof that controls existEscalation and review priority
Field verificationChecking whether critical controls are present at the point of workAuditing paperwork after exposure has passedRelease, pause, or restart

Conclusion

An FMEA for high-risk maintenance is useful only when it changes what the organization verifies before exposure starts. If the session ends with scores but no proof requirement, the team has documented concern without improving control.

Headline Podcast keeps returning to the gap between leadership language and real work. Use this 60-minute routine to make that gap visible before maintenance begins, then keep the conversation going at headlinepodcast.us.

Topics risk-management fmea maintenance-safety critical-controls fatal-risk ehs-manager headline-podcast

Frequently asked questions

What is FMEA in high-risk maintenance?
FMEA in high-risk maintenance is a structured review of how a planned task, component, control, or sequence can fail, what effect that failure can create, and what proof is needed before work starts.
How is FMEA different from a JSA?
A JSA breaks the job into steps, hazards, and controls. FMEA asks how the planned method or control can fail, then uses that failure path to define verification, release criteria, or escalation.
Can an FMEA be done in 60 minutes?
Yes, when the scope is tight. A 60-minute FMEA should focus on one high-risk maintenance scenario and the 8 to 12 failure modes most likely to change a control decision.
Who should attend a maintenance FMEA?
The review should include operations, maintenance, EHS, the work supervisor, equipment knowledge, and contractor supervision when contractors will perform or support the job.
What is the best output from a maintenance FMEA?
The best output is not the score. It is a short set of controls that need proof, owners with authority, stop criteria, and release criteria before high-risk work begins.

About the author

Andreza Araújo

Safety Culture Expert | Senior EHS Executive

Andreza Araújo is a safety culture expert and senior EHS executive with more than 25 years of experience in environment, health and safety. She is a Civil Engineer and Occupational Safety Engineer from Unicamp, holds a Master's degree in Environmental Diplomacy from the University of Geneva, and completed sustainability studies at IMD Switzerland. Andreza has served in Global Head of EHS roles in Fortune 500 environments, leading cultural transformation programs across multinational operations. She has represented Brazil as a speaker at the United Nations in Paris and has spoken at the International Labour Organization in Turin. She is the author of more than 16 books on safety culture in Portuguese, Spanish, English and German. Her work has earned more than 10 EHS awards, including two recognitions from Indra Nooyi, former PepsiCo CEO.

  • Civil & Safety Engineer (Unicamp)
  • M.A. Environmental Diplomacy (University of Geneva)
  • Sustainability Cert (IMD Switzerland)
  • People Management & Coaching (Ohio University)
  • UN Paris speaker representative for Brazil
  • ILO Turin speaker
  • LinkedIn Top Voice
  • Indra Nooyi PepsiCo CEO recognition (2x)

Documentaries

Watch Andreza's documentaries

Three productions on safety culture, organizational failure and the human lessons behind major disasters.

Podcasts

Listen to Andreza's podcasts

She hosts three shows on safety leadership, EHS and organizational culture, in English and Portuguese.

Summarize with AI