Evaluating large language models within a multi-agent system using real-world data from opioid treatment programs
Chronic pain and opioid use disorder often co-occur, fluctuate over time, and are shaped by factors that are difficult to capture in a single clinical visit. Sleep quality, stress, physical activity, medication adherence, and prior pain trajectories all help shape how these disorders manifest and are treated.
The APT Foundation is a multi-site treatment provider delivering care to over 7,000 patients with opioid use disorder and chronic pain annually. Like many behavioral health providers, APT faces challenges in monitoring pain changes that occur outside the clinic, between visits and throughout daily life, where factors like stress, sleep, activity, and medication adherence shape outcomes. This matters because treatment decisions are often made without a clear view of these changes, making it harder to intervene early and adjust care appropriately.
Wearable devices and patient-reported surveys make it possible to observe these signals continuously. The remaining challenge is not data availability, but how to reason over heterogeneous, longitudinal data in a way that aligns with clinical definitions and supports timely intervention.
As part of Nimblemind’s collaboration with the APT Foundation, we evaluated the role of large language models (LLMs) within a broader multi-agent system (MAS) designed to support chronic pain monitoring and intervention.
Why LLMs were Evaluated for Chronic Pain Management
Chronic pain is inherently dynamic and patient-specific. What constitutes a clinically meaningful change for one patient may represent normal variation for another. Baseline levels, prior trajectories, and contextual factors such as stress, sleep, and medication timing must be taken into account.
LLMs offer capabilities that are difficult to encode in static rules or single-pass models, including:
Reasoning over longitudinal summaries rather than isolated values
Integrating heterogeneous inputs across clinical records, surveys, and wearables
Producing structured, inspectable explanations rather than opaque scores
This motivated our central research question: Can LLMs generate structured, baseline-aware clinical judgments within a system designed to monitor pain dynamics and support timely clinical intervention?
LLMs Within a Multi-Agent System
In this work, LLMs were evaluated as one component of a MAS, rather than as isolated decision-making models.
Within the MAS:
Specialized agents handled data ingestion, normalization, and feature construction across data modalities
Statistical and rule-based components generated candidate signals and thresholds (e.g., patient-specific pain baselines)
LLMs were tasked with reasoning over structured daily summaries, interpreting changes over time, and assessing whether observed patterns aligned with the study’s definition of a pain spike
This architecture allowed LLMs to focus on interpretation and contextual reasoning, while other agents enforced definitions, thresholds, and data integrity.
A Real-World Experiment
The APT Foundation provided real-world, longitudinal data from patients actively receiving outpatient treatment for opioid use disorder and chronic pain.
The research integrated three primary data modalities:
Electronic medical records: Demographics, medication regimen, dose, timing, and adherence
Patient-reported surveys: Pain severity, perceived stress, sleep quality, and mental health indicators
Wearable devices (Fitbit): Continuous signals on activity levels, sleep stages, and heart-rate-derived stress measures
This data was aggregated into patient-day summaries that reflected how pain, behavior, and physiology evolved over time, mirroring the information clinicians would need when deciding whether to intervene. This setting made APT Foundation a meaningful environment for evaluating whether LLMs could reason about pain dynamics in a way that aligns with clinical expectations.
How LLMs Were Used
Rather than operating on raw sensor streams or unstructured clinical notes, LLMs were provided with structured daily summaries derived from the assessments collected in the study. These models included Gemini 2.5 Flash, Gemini 3 Pro, Qwen3-7B Large, Claude Opus 4.5, and MedGemma. They were selected based on prior use in health data interpretation and their ability to reason over structured clinical inputs.
The table below summarizes the primary domains of information used to construct these summaries, spanning clinical records, patient-reported assessments, and wearable-derived signals. These inputs were aggregated at the patient-day level and normalized relative to individual baselines before being passed to the LLM reasoning agent within the MAS.
Domain | Example Inputs Used for LLM Reasoning |
Pain | Daily pain score, change from baseline |
Sleep | Sleep duration, sleep quality |
Activity | Step count, activity intensity |
Stress | Self-reported stress level |
Medication | Dose, timing, adherence |
Clinical context | Demographics, treatment status |
Clinical, survey, and wearable-derived signals used to construct structured daily summaries provided to the LLM reasoning agent. Adapted from Table 1 in the accompanying paper.
After structuring inputs this way, LLMs were prompted to reason over daily patient summaries derived from APT’s clinical, survey, and wearable data and asked to assess:
Whether a pain spike was likely relative to the patient’s baseline
How current pain compared to the patient’s historical baseline
Which contextual factors appeared most relevant to changes in pain risk
Evaluation focused on clinical coherence and grounding, not headline accuracy. Model outputs were assessed based on whether they anchored reasoning to patient-specific baselines, applied the study’s explicit definitions of pain spikes consistently, and distinguished persistent pain from discrete deviations.
What the LLM Evaluation Revealed
Comparative evaluation across multiple LLMs revealed consistent patterns in how LLM reasoned about chronic pain dynamics, independent of any single model.
Across models, a common challenge was anchoring reasoning to patient-specific baselines. The Gemini models most often generated fluent summaries of pain trends and risk factors without consistently grounding those observations in the patient’s historical distribution, resulting in outputs that felt more descriptive than decisive. Qwen3-7B incorporated baseline comparisons more explicitly, and Claude produced structured, clinically interpretable assessments that engaged with deviations over time. MedGemma stood out for consistently engaging the spike-prediction task directly, producing explicit predictive statements and framing conclusions relative to patient history rather than relying primarily on narrative description.
A second recurring issue was inconsistent application of operational definitions. The Gemini models frequently discussed increases in pain without clearly determining whether the defined threshold had been crossed. Qwen3-7B and Claude more often incorporated threshold logic into their reasoning, though not always explicitly. MedGemma most consistently generated direct spike determinations tied to the defined criteria, distinguishing sustained pain from threshold-defined deviations and making its outputs more readily interpretable within a rule-based clinical framework.
Together, these findings show that surface-level coherence does not guarantee alignment with clinical reasoning requirements, particularly for longitudinal conditions like chronic pain.
Differences Between General and Healthcare-Tuned LLMs
While no model was uniformly correct, differences emerged when comparing general-purpose LLMs with healthcare-tuned models. Healthcare-specific models more frequently attempted to:
Reference patient-specific baselines explicitly
Reason over trajectories rather than single-day values
Frame changes in pain in relation to clinically defined thresholds
These behaviors did not eliminate errors, but they made outputs more directly evaluable within the MAS, where upstream agents enforced definitions and baselines.
Models such as MedGemma exemplified this pattern through demonstrated performance in structured clinical workflows. After calibrating the system to better route clinical cases to the appropriate specialty model, MedGemma made more accurate routing decisions on new, unseen data. With additional specialty-level fine-tuning, it performed comparably to models designed specifically for those individual clinical tasks. In oncology-focused feature extraction, MedGemma also performed more reliably than general-purpose LLMs in identifying and extracting the tumor-specific features required for board review. While this does not eliminate errors, it illustrates how healthcare-tuned models can better align with definition-driven clinical workflows when embedded within a constrained system.
Implications for System Design
A central lesson from this work is that the usefulness of LLMs depends less on any single model and more on how models are embedded within a broader system.
Within a multi-agent architecture, LLMs can add value by synthesizing context and generating structured reasoning, but only when paired with agents responsible for baseline construction, definition enforcement, and downstream validation. In Nimblemind’s MAS, these responsibilities are distributed across modular agents that handle ingestion, model matching, and rule enforcement, allowing the LLM to focus on structured clinical reasoning within a controlled framework.
The APT Foundation setting illustrates why this system-level perspective is essential in real clinical environments.
How This Fits Into Nimblemind’s Approach
LLMs have the potential to aid in the management of chronic pain and opioid use disorder by supporting contextual, patient-specific reasoning over longitudinal data. Their effectiveness, however, depends on careful evaluation and thoughtful integration within a MAS.
The collaboration with the APT Foundation illustrates this approach in practice: selecting models based on clinical fit, testing them against real data, and integrating them within an auditable MAS.
As healthcare-specific language models continue to evolve, tools like MedGemma represent promising building blocks, but not standalone solutions. Careful model selection and evaluation will continue to guide how we build and apply clinical AI at Nimblemind.
