Skip to main content
predictive-maintenanceanomaly-detectionAIcondition-monitoring

Why Your Threshold Alerts Miss 60% of Failures — And What to Do About It

Prevly Team·

Why Your Threshold Alerts Miss 60% of Failures — And What to Do About It

The 3 AM Phone Call

It's 3:14 AM when your phone rings. Pump 7A on the cooling loop just seized. Production line 3 is down. The maintenance team scrambles. Someone pulls up the monitoring dashboard — and every single alarm was green right up until catastrophic failure.

How is that possible? You spent months setting up condition monitoring. You have vibration sensors, temperature probes, current transformers. You set thresholds based on ISO 10816. Everything looked fine.

Except it wasn't fine. The pump had been slowly dying for three weeks. Your thresholds just couldn't see it.

The Threshold Problem: Static Rules in a Dynamic World

Here's the typical setup: you install a vibration sensor on a pump bearing, and you set an alarm at 4.5 mm/s RMS based on the ISO standard or the OEM manual. If vibration crosses that line, you get an alert.

This works — sometimes. It catches sudden failures where a bearing goes from normal to catastrophic in hours. But most industrial failures don't work that way.

Consider this real pattern: a pump bearing starts at a baseline vibration of 1.8 mm/s. Over six weeks, it creeps up to 2.1, then 2.4, then 2.9. At the same time, bearing temperature rises by 3 degrees Celsius, and motor current becomes slightly more erratic during startup. Each signal is well within its individual threshold. No alarm fires. Then one Tuesday, something shifts — vibration jumps to 6.2 mm/s and the bearing fails within hours.

The failure was predictable. The thresholds missed it because they were looking at the wrong thing: absolute values instead of patterns.

Second Scenario: The VFD-Driven Fan

Here's another pattern that thresholds consistently miss. A variable frequency drive (VFD) controls an air handling fan, varying speed between 600 and 1,800 RPM based on process demand. The fan's bearing is developing an outer race defect.

The challenge: vibration amplitude is directly proportional to speed squared. At 1,800 RPM, the bearing shows 3.2 mm/s — within the ISO 10816 "satisfactory" zone. At 600 RPM, the same bearing shows 0.4 mm/s — barely above noise floor. A fixed threshold set for full speed will never trigger at partial speed, and a threshold set for partial speed will false-alarm constantly at full speed.

Meanwhile, the defect is progressing. The bearing's vibration at a given speed is creeping up by 0.1 mm/s per week. But because the operating speed changes every few minutes, the trend is invisible in the raw time series. Only a model that normalizes vibration by operating speed — learning that "2.1 mm/s at 900 RPM is abnormal even though 3.0 mm/s at 1,800 RPM is fine" — can catch this.

VFD-driven equipment represents a growing share of industrial assets (an estimated 30-40% of motors in modern facilities), and every one of them has this same speed-dependent baseline problem that static thresholds cannot address.

Detection Methods Compared

Not all detection approaches are equal across failure types. Here's how common methods perform:

| Detection Method | Sudden Failure | Gradual Bearing Wear | Speed-Dependent Degradation | Multi-Sensor Pattern | Typical Lead Time | |---|---|---|---|---|---| | Static threshold | 70% detection | 20-30% detection | < 10% detection | Not applicable | Hours | | Envelope analysis | 50% detection | 60-70% detection | 40% detection | Not applicable | Days to weeks | | LSTM Autoencoder | 85% detection | 85-90% detection | 80-85% detection | 90%+ detection | 2-4 weeks | | TranAD (Transformer) | 90% detection | 90-95% detection | 90% detection | 95%+ detection | 2-6 weeks |

The key insight: threshold-based methods have a ceiling imposed by their single-sensor, fixed-baseline architecture. ML methods improve with data volume and capture the cross-sensor, speed-dependent patterns that represent the majority of real-world failures.

Why Machines Don't Fail According to Rules

Industrial equipment is messy. Three things make static thresholds fundamentally inadequate for catching most failures:

Gradual wear is invisible to fixed limits. A bearing doesn't go from "fine" to "broken." It degrades over weeks or months. The early signatures are tiny — a 0.3 mm/s increase in vibration, a half-degree temperature shift, a subtle change in spectral harmonics. Each one individually is noise. Together, they're a clear signal.

Failure patterns span multiple sensors. A pump doesn't fail in one dimension. An outer race defect shows up in vibration, temperature, current draw, and sometimes flow rate — simultaneously, but subtly. No single threshold captures a multi-dimensional pattern. You'd need to write hundreds of cross-sensor rules, and you'd still miss the ones you didn't think of.

Normal changes with conditions. A motor running at 1,800 RPM on a cool Monday morning has a different "normal" than the same motor at 3,600 RPM on a hot Friday afternoon under full load. Seasonal temperature swings, product changeovers, and load variations all shift the baseline. A fixed threshold either triggers false alarms constantly or is set so high that it misses real problems.

Industry data backs this up: studies from McKinsey and various reliability engineering surveys suggest that traditional threshold-based monitoring catches fewer than half of preventable failures — many estimates put it around 40%. The rest either show up as surprises or are caught by a human who happened to notice something felt off.

What AI-Based Detection Actually Catches

The alternative isn't more rules. It's a system that learns what "normal" looks like for each machine, under its specific operating conditions, and flags when reality starts diverging from that learned baseline.

This is what LSTM autoencoders do. An LSTM (Long Short-Term Memory) network is a type of AI model that's particularly good at learning patterns in time-series data — the kind of data your sensors produce. An autoencoder is trained to reconstruct normal behavior. When the machine is healthy, the model's reconstruction closely matches reality. When something starts going wrong, the reconstruction error spikes — even if every individual sensor is still within its threshold.

Think of it like this: you know the sound of your car engine. You can't write down the exact frequency spectrum that defines "normal," but you instantly notice when something sounds off. An LSTM autoencoder does the same thing with sensor data — but across 10 or 20 sensors simultaneously, 24/7, without getting tired.

The model learns per machine. Pump 7A's normal is different from Pump 7B's normal, even if they're the same model. It adapts to load, speed, and ambient conditions. And it catches the kind of slow, multi-sensor degradation that thresholds simply can't.

The Explainability Gap

Here's where most AI solutions stumble. The model detects an anomaly and fires an alert: "Pump 7A — anomaly detected, confidence 94%."

Great. Now what?

Your reliability engineer gets that alert and asks the obvious question: why? Which sensor? What changed? Is this a bearing problem or a seal problem? Should I schedule a shutdown or just keep an eye on it?

If the answer is "the model said so," that alert goes in the trash. And rightfully so — no experienced engineer is going to shut down a production-critical pump based on a number from a black box.

This is where feature attribution changes the game. Prevly's models report exactly how much each input feature (each sensor reading, each calculated metric) contributed to the decision — which signals drove the alert, and by how much. (For the gradient-boosted RUL model that's SHAP; for the deep-learning anomaly and fault models it's Integrated Gradients — both produce the same kind of per-feature contribution breakdown.)

14 Days of Warning — With Receipts

Here's a concrete example from bearing outer race fault detection. The AI-based system flags an anomaly on a centrifugal pump 14 days before the bearing would have failed. The alert doesn't just say "anomaly" — it includes feature attribution:

  • vibration_x_rms: +0.34 — the dominant contributor, elevated vibration in the radial direction
  • temperature_delta: +0.21 — bearing temperature rising faster than housing temperature
  • current_kurtosis: +0.12 — subtle spikes in motor current, indicating intermittent mechanical resistance

The engineer reads this and immediately has a hypothesis: elevated radial vibration plus thermal rise plus current spikes — that's a classic outer race defect pattern. They schedule an inspection during the next planned downtime, confirm the defect with ultrasound, and swap the bearing during a 2-hour window instead of dealing with a catastrophic failure and 18 hours of unplanned downtime.

The threshold system? Still green. It would have stayed green for another 12 days.

From Reactive to Predictive

The shift from threshold alerts to AI-based anomaly detection isn't about replacing your monitoring infrastructure. Your sensors, historians, and SCADA systems stay exactly where they are. The difference is what sits on top: instead of static rules, you have a system that learns, adapts, and explains.

The transition doesn't happen overnight, and it doesn't have to. Most plants start by layering ML-based detection on their 10-20 most critical assets — the ones where unplanned downtime costs the most. The existing threshold alerts stay in place as a safety net. Within 2-4 weeks, the ML model learns the normal operating patterns for each machine. Within 2-3 months, you have enough data to compare detection rates: how many anomalies did the ML model catch that thresholds missed? In our experience, the answer is consistently 3-5x more detections with 60-80% fewer false alarms.

The economics are straightforward. A single prevented unplanned shutdown on a critical production line — caught 2 weeks early instead of at 3 AM — typically pays for a year of predictive monitoring across the entire facility. The question isn't whether AI-based detection works better than thresholds. The question is how many failures you're willing to miss while deciding.

For plant managers, this translates directly to numbers: fewer unplanned stops, lower spare parts inventory (because you know what's failing before it fails), and better maintenance scheduling. For reliability engineers, it means spending less time chasing false alarms and more time on the failures that actually matter — with the data to back up every decision.

See It on Your Own Data

Prevly brings AI-based anomaly detection with built-in feature-attribution explainability to industrial equipment — without requiring a data science team. Connect your sensor data, and within days you'll see what threshold alerts are missing.

Start your free trial at prevly.org and find out what your machines have been trying to tell you.

Related reading: How SHAP explains a prediction · RUL prediction explained · From sensors to predictions