Turning Wearable Noise Into Clinical Signal

How we normalize messy device data, run daily batch analytics in a Rails and MySQL stack, and keep insights explainable

TL;DR

Normalize and permission like a product. Batch before you stream. Pin what “yesterday” means. Plan for overwrites and deletes. Test mappers like payment code.

Connecting OAuth is the easy hour. The rest of the work is turning inconsistent device data into something that practitioners can actually interpret. In our case that meant building canonical daily metrics, percentile-based baselines, and multi-day rule-based insights — all while keeping the system explainable rather than turning it into a black box.

This system provides decision support and longitudinal context for care teams. It is not a diagnostic product and not a substitute for clinical judgment.

Wearables are easy to demo and hard to productionize

Consumer wearable experiences are designed around immediacy: charts update constantly; activity rings fill in throughout the day; scores change overnight. That works well for personal feedback, but when we started to explore how wearable data might fit into clinical workflows, a different question emerged:

What would actually help a practitioner notice something important?

Not another dashboard. Not another stream of metrics. Just a signal.

At Fullscript we didn’t set out to rebuild a wearable app. Instead, we focused on a narrower question:

How do you compress inconsistent, multi-source wearable data into a small set of readable signals that make sense in long-term care?

Answering that turned out to involve much less wearable technology and much more systems engineering.

The hackathon told the truth (and hid most of the work)

Our first internal wearable integration came together quickly during a company event. Within hours we had a working prototype: connect a device, pull some data, display metrics.

Technically speaking, the demo succeeded! But, looking at it through a production lens quickly exposed the real work. We needed to figure out how to:

Reliably sync and backfill
Map to shared metric names and units, in a stable way
Design operational sanity at population scale
Build interpretations that respect individual physiology
Deliver a scannable surfaced signal set for practitioners

Calling an API was never the difficult part. The difficult part was turning chaos at the edge into stable rows and defensible labels.

Wearable data is a schema problem wearing a health problem costume

One of the first realizations we had was that wearable vendors don’t just disagree on APIs. They disagree on how health data should be represented.

Sleep illustrates this well; some vendors expose sleep stages, while others produce composite sleep scores. Resting heart rate may be calculated differently depending on the device and firmware. HRV (heart rate variability) measurements can vary depending on sampling window and algorithm.

Mobile platforms often add another layer of aggregation that improves access to the data, but it can blur provenance; the record may originate from an app we don’t control.

Rather than trying to standardize the edge, we assume the edge is messy.

Our system stores and analyzes data using a canonical vocabulary, while edge mappers handle aggregation and unit normalization. These mappings translate vendor outputs into a consistent internal structure.

This mapping layer is never finished. Firmware updates, SDK changes, and operating system updates continuously move the goalposts.

Because of that, we also persist the pathway that produced each row. When two sources claim the same person, metric, and local day, the system applies a deterministic resolution policy. Today we designate a primary vendor for baseline and early-stage processing so debugging and replays remain possible.

Acknowledging conflicts explicitly is far better than silently accepting whichever record was written last.

When the happy path ends (and it will)

Integrations always look complete until something breaks: OAuth tokens expire, vendor APIs introduce new rate limits, mobile uploads appear after you already considered a day “complete.” Production systems need to expect this.

We plan for partial syncs, allowing retry of specific slices without corrupting unrelated metrics. We expect authentication churn without creating retry storms. We foresee vendor algorithms to change and historical sessions to be corrected.

In practice that means idempotent keys and overwrites for the same logical day. Duplicate metrics arriving from multiple sources are handled through policy and provenance, not optimism.

Disconnects and deletions are also treated as first-class events. When a user disconnects a device, the system must stop ingesting data immediately, ensure queued work doesn’t resurrect it, and align retention with privacy commitments and jurisdictional requirements.

A surprising amount of wearable engineering turns out to be about state transitions rather than parsing JSON.

Consent follows the same philosophy. Device connections and data sharing are patient-controlled, and practitioner visibility follows the same permission standards used for other sensitive health information.

Two ways data gets in, one way it is stored

Regardless of source, everything converges into the same storage model.

There are two primary ingestion paths.

The first is server-side vendor sync. For example, when a user connects an Oura device, our backend pulls date ranges from the vendor API, maps the results into canonical daily metrics, and performs batched upserts keyed by person, vendor, metric, and local date.

The second path handles bulk uploads from the client. Devices integrated through mobile health frameworks send GraphQL batches from the device. These uploads are validated against the same canonical metric set and written using the same storage logic.

All writes are batched upserts. This improves throughput and keeps MySQL healthy under load.

For Oura specifically, a new connection pulls approximately ninety days of historical data. Longer gaps are chunked into the same window size so that jobs and APIs remain bounded.

It is a boring implementation detail, but boring details tend to produce stable systems.

Why we batch instead of streaming

Early on, we debated whether the system should process wearable data as a real-time stream. On paper that sounds attractive: continuous flows of physiological data feeding into analytics pipelines. But when we stepped back and looked at the clinical use case, the value of real-time processing became less clear.

A single night of poor sleep is rarely meaningful. A small spike in resting heart rate for one day may not matter. Patterns across several days or weeks are much more interesting. So, we chose a deliberately simple architecture: daily summaries as the unit of record.

Vendor APIs are polled. Daily metrics are normalized and stored. Batch analytics jobs run across connected users.

This approach keeps fan-out bounded and makes failures easier to reason about. When something goes wrong we know it happened during a specific batch window rather than somewhere inside a streaming pipeline.

Metrics are keyed on local calendar days. To prevent batch jobs from processing an incomplete day, we anchor “yesterday” to US Eastern time so that a UTC cron job does not split a day while it is still in progress.

We did not optimize for sub-minute latency. Native wearable apps already provide intraday feedback. Our system focuses on longitudinal compression within the clinical workflow.

Three days on one patient

Consider a simple example.

Alex connects a wearable device on Monday. By the end of that day the system has daily rows for sleep duration, sleep heart rate, sleep HRV, and breathing rate, each with a timestamp and provenance.

On Tuesday the nightly batch run processes Monday’s data as the latest completed slice. Baselines reference Alex’s prior thirty days of Alex-specific history. That history is thin early in the connection, so behavior remains conservative.

Layer 1 labels Monday’s metrics using two perspectives simultaneously: population guardrails, and Alex-specific baseline percentiles. This allows the system to show both “generally normal” and “unusual for Alex lately.”

Wednesday introduces another data point. The personal baseline begins to stabilize as more history accumulates.

Later, a higher-level insight might fire not because of a single spike but because multiple signals align across several days inside a sliding window.

That difference — spike versus story — is where most of the signal emerges.

Baselines: your place in your own recent history

Global thresholds are appealing but rarely sufficient. For example, declaring that a resting heart rate above a specific value is “bad” ignores how much physiology varies across individuals. Instead of relying on fixed thresholds, we compare each day’s value against that person’s own recent history.

Most metrics use a thirty-day window. The value for the baseline date is ranked against that window and assigned a percentile. The system stores the percentile rank rather than a single “normal” value. This allows the system to represent where a measurement sits relative to the person’s own physiology.

Sparse windows degrade gracefully. A small amount of historical data produces conservative interpretations rather than aggressive conclusions. This approach avoids trying to define universal “normal” values and instead focuses on detecting deviations from an individual’s baseline.

Layer 1: population and personal signals together

Practitioners need simple states that are easy to scan. Layer 1 produces these states by assigning two classifications for each metric on each day.

The first classification uses configured population bounds. The second classification uses the baseline percentile derived from the patient’s recent history.

Using both perspectives together reduces false reassurance from personal-only comparisons and reduces unnecessary alerts from population-only thresholds.

The result is a system that can say both: “this looks fine generally”, and
“this is unusual for this person.”

Layer 2: composition with persistence

Physiology rarely changes in isolation. A slightly elevated sleeping heart rate may not be meaningful on its own. Neither is a modest drop in HRV. But when several signals move together across multiple days, patterns begin to emerge.

Layer 2 insights combine Layer 1 signals using rule-based composition. These rules specify required metric states and persistence requirements. Instead of reacting to a single abnormal day, the system looks for qualifying patterns across sliding windows.

For example, elevated breathing rate during sleep combined with higher heart rate and skin temperature shifts across several days may indicate that multiple physiological systems are under increased demand.

Variants of these rules are evaluated in priority order so that the most explanatory match wins. This composition step is more computationally expensive than per-metric classification, so it runs on a slower cadence. Layer 1 updates daily, while Layer 2 runs weekly.

That trade-off keeps the system fresh where it matters while keeping computational costs reasonable.

Keeping mapper drift boring

Vendor schemas evolve constantly. Without safeguards, small changes in vendor responses can silently alter metric interpretation. To prevent this, we treat mapping and aggregation logic with the same discipline used for payment processing or financial calculations.

Mapper functions have dedicated tests and fixtures. When vendor APIs change, the tests fail loudly rather than quietly producing incorrect results.

This work is unglamorous, but it is the difference between plausible dashboards and trustworthy systems.

Above the engine: attention, audiences, and non-goals

Deciding how insights are surfaced is primarily a product question — ranking, density, and interaction design. But the engineering layer still has obligations. Every surfaced insight must remain traceable. Practitioners should be able to see which metrics, which dates, and which rule variants produced a signal.

Patients and practitioners also receive different framing. The underlying signal remains the same, but language and emphasis differ depending on the audience. The analytics layer therefore stores structured insight payloads rather than hard-coded prose.

We also made deliberate non-goals. Real-time nudges and intraday coaching were deprioritized. Wearable devices already perform those functions well. Our system complements them by providing longitudinal context within clinical workflows

Closing the loop

The future will contain more sensors, more devices, and more health data per person. The engineering challenge is not collecting that data: the challenge is compressing it.

In our case, the architecture eventually settled into a relatively simple shape: heterogeneous device data → canonical daily metrics → personalized baseline percentiles → multi-day compositional insights

Each stage reduces noise while preserving enough traceability for practitioners to understand where the signal came from. Over time the system grows by adding mappings and data sources rather than adding increasingly complex application logic.

In the long run, that kind of compression may matter more than the sensors themselves.

Share this post