Time-series language models are landing in hospitals, wearables, and factories — and the teams shipping them have no way to know what they actually do. AHRI maps the gaps and retrains them closed, so the models you ship can be trusted with the work.
Time-series language models are already reading ECGs, scoring sleep, recognising activity from wearables — and the applied results are genuinely good. But a strong score on a handful of real-world tasks doesn’t tell you what a model actually learned. AHRI breaks signal understanding down into its component skills and tests each one on its own, so you can see where a model is sharp and where it’s just guessing.
# One signal. One question. One skill.
[SIGNAL TOKENS]
Question: Does this signal have an upward trend, a downward trend, or no trend?
Answer:
# No statistics in the text.
# The answer must come from the signal.
A real ECG question pulls on signal reading, medical knowledge, and a bit of guesswork all at once. AHRI’s tasks isolate one skill apiece — so a score measures exactly that, and nothing else.
Run every model through 22 controlled capability probes — each one isolating a single skill.
Get a per-task score map: which skills are solid, which are fragile, which never showed up.
The tasks a model trips on are exactly the ones worth drilling — comparing amplitudes, say, or spotting when several features overlap at once. AHRI stacks those weak skills, in ascending difficulty, into a focused training regime: the Ascending Harmonic Reasoning Instruction the project is named for.
Score architectures and pretraining strategies head-to-head against the same fixed yardstick.
Each level is strictly harder than the one below it — from noticing something is there, up through measuring it, comparing it, tracking how it changes, and finally combining everything at once. The harder levels are where the gaps tend to hide.
79.7% in-distribution · 65.8% out-of-distribution over ten epochs on a small Time Series Language Model. Every task in the framework gets one of these.
Three questions drive the work: which skills a model genuinely learns rather than fakes, when those skills appear during training, and — with the tests held fixed as a yardstick — which architecture learns them best.
A model can answer an ECG question by reading the signal — or by pattern-matching the words around it. AHRI’s controlled tasks tell the two apart, so a skill only counts when the model actually earned it.
Skills don’t fade in gradually. They tend to snap on partway through training. Knowing when — and in what order — is what makes training them deliberately possible.
With the tests held fixed as a yardstick, the next step swaps the model itself — comparing time-series encoder architectures to find the one that learns fastest and generalises furthest.
Time-series language models are landing in hospitals, wearables, and factory floors faster than anyone knows what they actually do. AHRI tests them on every capability, surfaces the gaps, and retrains until those gaps close. That’s the path from impressive demo to production-ready.
Built for teams shipping TSLMs in clinical, wearable, and industrial settings.Or by email — tony@ahriai.com