From Sensors to Predictions: How a PdM Platform Actually Works
From Sensors to Predictions: How a PdM Platform Actually Works
"We'll just put ML on the sensor data" is the handwave that launches a thousand failed PdM projects. The reality is that the ML model is perhaps 20% of the system. The other 80% is getting data reliably from sensors to models, and getting predictions reliably from models to humans.
Here's how the full pipeline works in a modern platform, from physical sensor to engineer's phone.
Layer 1: Edge Collection
Industrial sensors produce data at wildly different rates. A vibration accelerometer on a bearing samples at 12,800 Hz. A temperature probe updates every 30 seconds. A flow meter sends data every 5 seconds.
An edge gateway handles:
- Protocol translation — OPC-UA, MQTT, Modbus RTU, and analog 4-20mA signals all need different drivers
- Local buffering — Network outages shouldn't lose data. The gateway stores readings locally and syncs when connectivity returns
- Downsampling — 12.8 kHz vibration data is summarized (RMS, peak, kurtosis) for continuous monitoring, with raw waveforms captured periodically for spectral analysis
The Protocol Challenge
Protocol translation sounds simple in a vendor slide deck. In practice, it's where most PdM deployments hit their first wall.
OPC-UA is the modern standard, but "standard" is generous — each equipment vendor implements it differently. You'll spend days configuring security certificates (OPC-UA mandates X.509 certificate exchange), mapping node IDs to meaningful tag names, and dealing with servers that implement only a subset of the spec. Newer equipment is generally compliant; anything manufactured before 2015 is a coin flip.
Modbus RTU/TCP is reliable but crude. Register mapping is entirely manual — the sensor documentation tells you "register 40001 is vibration X-axis, 32-bit float, big-endian," and you translate that into configuration. Multiply by 50-200 sensors per facility, and you understand why protocol mapping alone can consume a full week of commissioning.
4-20mA analog signals require additional hardware: signal conditioners, analog-to-digital converters, and careful grounding to avoid noise. A 0.1 mA offset on a 4-20mA current loop translates directly to a measurement error that your ML model will interpret as a process change.
Edge hardware requirements are often underestimated. A gateway handling 200 sensors needs sufficient compute for local buffering (minimum 32 GB storage for 72-hour offline capability), protocol translation (CPU-bound for OPC-UA encryption), and optional edge inference. Industrial-grade gateways from vendors like Advantech, Moxa, or Siemens IOT2050 typically range from €500-2,000 per unit, with ARM or x86 processors and industrial temperature ratings (-40 to 70°C).
Layer 2: Streaming Ingestion
Data leaves the edge as MQTT messages and enters a message broker (typically Apache Kafka) that provides:
- Durability — Messages persist until consumed, surviving service restarts
- Multi-consumer — The same data feeds real-time processing, cold storage, and ML pipelines simultaneously
- Schema enforcement — Avro schemas validate data structure before it enters the pipeline
At this stage, data is validated: sensor IDs are verified, timestamps are checked for drift, and readings outside physical bounds are flagged. A vibration reading of -500 mm/s or a temperature of 3,000°C gets quarantined, not passed to ML models. Quality scoring (0-100) tags each reading, and anything below 50 is filtered before it reaches the feature engineering stage.
Layer 3: Stream Processing
A stream processor (Apache Flink) transforms raw readings into ML-ready features in real time:
- Rolling statistics — Mean, standard deviation, RMS, kurtosis over 60s, 5min, and 30min windows
- Trend detection — Is vibration increasing, stable, or decreasing over the last hour?
- Cross-sensor correlation — Temperature rising while vibration is stable suggests ambient change, not degradation
- Frequency features — FFT on vibration data extracts bearing frequencies, gear mesh frequencies, and harmonics
Feature Engineering in Detail
The quality of ML predictions is determined here, not in the model architecture. A few concrete examples:
Rolling RMS vs. peak-to-peak: RMS vibration captures overall energy and is good for detecting imbalance or misalignment. Peak-to-peak captures transient impacts and is better for early bearing damage detection (where metal-to-metal contact creates short, sharp spikes). Computing both at multiple window sizes (60s, 300s, 1800s) gives the model different views of the same degradation process.
FFT and bearing defect frequencies: For vibration-based bearing monitoring, raw FFT isn't enough — you need to know what to look for. Every rolling element bearing has four characteristic defect frequencies, all derived from bearing geometry and shaft speed:
- BPFO (Ball Pass Frequency Outer) = (N/2) × RPM × (1 - d/D × cos(α)) — outer race defect
- BPFI (Ball Pass Frequency Inner) = (N/2) × RPM × (1 + d/D × cos(α)) — inner race defect
- BSF (Ball Spin Frequency) = (D/2d) × RPM × (1 - (d/D × cos(α))²) — rolling element defect
- FTF (Fundamental Train Frequency) = RPM/2 × (1 - d/D × cos(α)) — cage defect
Where N = number of rolling elements, d = ball diameter, D = pitch diameter, α = contact angle. The stream processor computes FFT on raw vibration waveforms and extracts amplitude at these specific frequencies (plus 2x and 3x harmonics). An increasing amplitude at BPFO with harmonics is a textbook outer race defect — the kind of pattern that's invisible in time-domain RMS but unmistakable in the frequency domain.
Window size selection matters more than most teams realize. A 60-second window captures fast transients (motor startup, load changes). A 30-minute window smooths those out and reveals slow trends. Using only one window size forces the model to learn both fast and slow dynamics from a single representation — providing both gives it the temporal resolution to distinguish "the motor just started up" from "this bearing is getting worse."
Layer 4: Storage
Processed data lands in a time-series database (TimescaleDB) optimized for:
- Hypertables — Automatic partitioning by time for fast range queries
- Continuous aggregates — Pre-computed rollups (1min, 5min, 1hour) for dashboard performance
- Retention policies — Hot data (90 days) in PostgreSQL, cold data exported to columnar storage for long-term analytics
Multi-tenancy is enforced at the database level with row-level security (RLS). Every query — whether from the API, a dashboard, or an internal service — is automatically scoped to the authenticated tenant. This isn't application-level filtering that a bug could bypass; it's a database-level constraint that makes cross-tenant data access physically impossible without superuser credentials.
Layer 5: ML Inference
Three model types per asset:
Anomaly Detection
An LSTM autoencoder learns the normal operating pattern for each asset. High reconstruction error = current behavior doesn't match learned patterns. The model processes a sliding window of multivariate sensor data (typically 60 timesteps) and outputs a reconstruction error score. When this score exceeds a learned threshold (calibrated per-asset from healthy operating data), an anomaly is flagged.
Remaining Useful Life (RUL)
LSTM models estimate days until the next maintenance event. An LSTM processing raw 14-sensor input achieved RMSE of 11.48 days on the NASA C-MAPSS benchmark — meaning predictions are typically accurate to within 11 days. For practical maintenance planning, that's the difference between "schedule it this week" and "schedule it this month."
Fault Diagnostics
A 1D-CNN with attention classifies vibration waveforms into fault categories: inner race, outer race, ball defect, cage defect, misalignment, imbalance. Pre-trained on the CWRU bearing dataset and fine-tuned on plant-specific data.
Training Data Requirements and Accuracy Expectations
Model performance depends heavily on data volume:
| Data Volume | Best Model | Expected Accuracy | |---|---|---| | < 1,000 samples | Isolation Forest | Detects gross anomalies, no RUL | | 1,000 - 50,000 | LSTM Autoencoder | Good anomaly detection, basic RUL | | 50,000 - 200,000 | LSTM + LightGBM | Strong anomaly + RUL predictions | | > 200,000 + GPU | TranAD (Transformer) | State-of-the-art anomaly + RUL |
For a new deployment with zero historical data, pre-trained models (trained on the CWRU bearing dataset and NASA C-MAPSS turbofan data) provide immediate baseline capability. These aren't perfect for your specific equipment, but they encode general degradation physics that transfer surprisingly well across similar equipment classes.
Layer 6: Alert Engine
The alert engine applies business rules:
- Severity mapping — Anomaly score thresholds map to severity levels. A score 2x above threshold is WARNING; 4x is CRITICAL. These multipliers are configurable per asset class, because a "critical" anomaly on a backup pump has different operational implications than the same score on a single-point-of-failure compressor.
- Deduplication — A degrading bearing generates continuous anomaly scores above threshold. Without deduplication, you'd get an alert every inference cycle (typically every 30-60 seconds). The alert engine groups related anomalies by asset and failure mode, sending a single alert with updates rather than a flood of repetitive notifications.
- Escalation — WARNING not acknowledged within a configurable window (default: 4 hours)? Escalate to CRITICAL and notify the next level. CRITICAL not acknowledged in 1 hour? Page the on-call manager.
- Feature attribution — Every alert shows which sensors contributed and by how much. This isn't optional or a premium feature — it's the difference between an alert that gets investigated and one that gets dismissed.
Integration Patterns
Alerts need to reach people where they work. The alert engine supports multiple delivery channels simultaneously:
- PagerDuty / OpsGenie — for on-call rotation and escalation
- ServiceNow / SAP PM / Maximo — auto-created work orders with diagnostic context
- Webhooks — for custom integrations with internal systems
- Email / SMS — for teams not using incident management platforms
- Mobile push — for operators on the plant floor
The Cold Start Problem
The most common question from new PdM deployments: "What happens on day one with zero data?"
This is where most in-house PdM projects stall. Training an LSTM autoencoder from scratch requires weeks of clean operating data. Training a RUL model requires historical failure data — which you might not have in structured form.
A modern platform addresses cold start in three phases:
Day 1-7: Pre-trained models. Models trained on public benchmark datasets (CWRU bearings, NASA C-MAPSS turbofan engines) and cross-customer anonymized data provide immediate anomaly detection. They won't catch equipment-specific failure modes, but they'll catch common degradation patterns (bearing wear, imbalance, thermal runaway) that account for 60-70% of rotating equipment failures.
Week 2-4: Baseline learning. With 2-4 weeks of continuous data, the platform trains asset-specific models. The LSTM autoencoder learns what "normal" looks like for each machine under its specific operating conditions. Anomaly detection accuracy improves significantly because the model now knows that Pump 7A at 1,800 RPM normally vibrates at 1.8 mm/s — not just that "pumps vibrate between 0 and 10 mm/s."
Month 2+: Progressive ML. As data accumulates, the platform automatically upgrades models: Isolation Forest → LSTM Autoencoder → TranAD (Transformer-based). Each upgrade improves both detection sensitivity and lead time. The transition is automatic — the platform tracks each model's performance metrics and promotes the better model when it proves itself on held-out validation data.
Layer 7: Action
Alerts trigger: PagerDuty/ServiceNow notifications, auto-created work orders, webhooks for custom integrations, and real-time dashboard visualization. The final layer closes the loop: predicted failure → alert → investigation → work order → repair → confirmation. Every step is logged, creating the audit trail that feeds back into model retraining and ROI measurement.
Prevly handles all seven layers as a managed platform. See how it works with an interactive demo using real ML models.
Related reading: RUL prediction explained · Read-only OPC-UA monitoring · Getting started with vibration analysis