Evaluating a vision-language model within a structured workflow for clinical model selection and specialty-level deployment
Mar 20, 2026

Clinical AI workflows today are often fragmented, with fewer than 10% of published clinical AI models ever reaching real-world deployment despite strong performance in research settings. This gap is driven by integration, validation, and monitoring requirements that extend beyond model development. Triage, task selection, and model deployment are typically handled by separate systems, requiring data scientists to manually identify the right model for each dataset and maintain multiple task-specific pipelines. This separation increases the risk of mismatched model selection, duplicated engineering effort, and inconsistent performance across use cases. It also introduces ongoing maintenance overhead, as each additional model must be independently validated and monitored. As a result, even high-performing models, such as FDA-cleared imaging algorithms for detecting diabetic retinopathy or lung nodules, can face delays in reaching real clinical settings.
In practice, selecting the right model is not always straightforward. For clinical imaging tasks, key attributes such as modality (e.g., CT vs. MRI) and clinical indication are embedded in raw image data rather than explicitly labeled. Extracting this information requires additional preprocessing and domain knowledge, and mistakes at this stage can lead to suboptimal model choices. At the same time, health systems must validate, deploy, and monitor models individually, creating overhead that grows with each new use case.
These challenges highlight a gap between model development and real-world deployment. The question is no longer just whether a model performs well on a benchmark. It is whether it can be integrated into a workflow that is reliable, scalable, and maintainable over time. For example, integrating with electronic health systems, supporting multiple imaging modalities across sites, and maintaining performance despite shifts in data quality and patient populations.
As part of Nimblemind’s ongoing work in clinical AI infrastructure, we evaluated whether a single vision-language model (VLM), MedGemma, could help bridge this gap by supporting both model selection and downstream clinical tasks within a unified system. This evaluation was conducted within the context of Nimblemind’s multi-agent system (nMAS), which coordinates data interpretation, model selection, and task execution across clinical workflows.
Why Clinical AI Workflows Need to Evolve
Despite advances in machine learning, most clinical AI models never reach production. This gap is driven by challenges beyond model performance, including the need for clinical and regulatory validation, integration with existing hospital infrastructure, and ongoing monitoring to ensure safety and reliability in real-world settings. Studies show that external validation is performed in fewer than 30% of published models and that performance often drops by 10-30% when applied to new clinical settings.
Two structural challenges consistently emerge:
Model selection is manual and opaque – Data scientists must infer key dataset characteristics, such as modality or pathology, before choosing an appropriate model.
Deployment is fragmented – Each task-specific model must be independently validated, integrated, and maintained.
These challenges are not due to a lack of models, but rather a lack of systems that can determine when and how to use them effectively. This creates an opportunity to rethink how clinical AI workflows are designed. Rather than introducing more models, the focus shifts to building coordinating systems, such as the nMAS, that can orchestrate model selection, reasoning, and deployment within structured clinical workflows.
Why MedGemma for Clinical Workflows
VLMs are designed to jointly process images and text, making them well-suited for clinical settings where both visual data and structured reasoning are required. Unlike traditional image models that produce a single prediction, they can:
Interpret inputs across modalities
Follow structured instructions
Generate structured, inspectable outputs
MedGemma builds on this capability with pretraining across a wide range of clinical imaging domains, including radiology, pathology, ophthalmology, and dermatology. This broad exposure allows the model to recognize different modalities and disease patterns while remaining flexible enough to support multiple downstream tasks.
This flexibility is particularly important in clinical workflows, where the same input may require different types of reasoning depending on context. For example, an image may need to be categorized by modality, assessed for abnormalities, and routed to the appropriate model or task. Traditional pipelines handle these steps with separate components. This work explores whether a single model can instead reason across these steps in a structured way.
This leads to the central question: Can a VLM serve both as a model selector and as a deployable clinical model within real-world workflows?
A New Approach: One VLM, Two Roles
To answer this question, we evaluated MedGemma in two distinct but related roles:
Model selection (routing) – Determining which model or task is appropriate for a given input
Model execution (deployment) – Performing the selected clinical task within a specialty
These roles were evaluated independently, but together they represented two critical steps in clinical AI workflows:
Deciding what to do
Then doing it
By using the same model for both roles, we were able to improve routing accuracy by 10% and consolidate multiple task-specific models into a single specialty-level model per domain within the nMAS, reducing system complexity while maintaining performance.
Role 1: Stage-Wise Model Selection
The first role focused on improving how models were selected for a given clinical input. In many workflows, this step was handled manually, requiring engineers to inspect datasets and match them to appropriate models. This process was time-consuming and often inconsistent. To address this, we implemented a structured, three-stage workflow in which MedGemma acted as an aware model selector within the nMAS.
Stage 1: Modality Identification. MedGemma determined the type of input it was observing (e.g., CT, MRI, histopathology). If the input did not match known modalities, the model could abstain, reducing incorrect routing.
Stage 2: Primary Abnormality Detection. MedGemma identified the most relevant clinical finding visible in the image, constrained to observable features to reduce overinterpretation.
Stage 3: Model-Card Matching. MedGemma selected the most appropriate model from a repository of model cards, matching both modality and clinical context, or abstaining when no suitable option existed.
This transformed model selection into a structured reasoning process rather than a manual decision. Even with structured prompts, outputs could vary. To improve reliability, we introduced an answer-selection mechanism that:
Considered both the top prediction and the second-most likely alternative
Selected the second option if it met a confidence threshold
This approach improved routing accuracy by approximately 10% and improved calibration, resulting in a system that was more accurate, transparent, and auditable.
Role 2: Specialty-Level Deployment
The second role focused on reducing the complexity of deploying clinical AI models. In many healthcare systems, each task requires a separate model, and even within a single specialty, multiple models are used for different tasks. Each must be validated, integrated, and maintained independently, creating significant overhead.
To address this, we evaluated whether MedGemma could be fine-tuned at the specialty level to support multiple tasks within a single model. Instead of training one model per dataset, tasks were grouped by clinical specialty, and MedGemma was adapted to handle multiple tasks within that domain. This shifted effort from building new models to adapting a shared model.
Across datasets and specialties, we observed consistent patterns:
The base (zero-shot) model performed modestly
Specialty-specific fine-tuning improved performance significantly
Fine-tuned models approached task-specific benchmarks
For example, in pathology and ophthalmology tasks, fine-tuned MedGemma achieved performance comparable to specialized models while maintaining a unified deployment framework.
By reducing the number of models required, this approach simplified:
Validation and regulatory review
Integration into clinical systems
Monitoring and maintenance
This made it easier to scale clinical AI without increasing system complexity.
What This Means for Clinical AI Systems
A key insight from this work was that clinical AI performance depends as much on workflow design as on model accuracy. MedGemma was effective not because it replaced all components, but because it operated within a structured system:
In model selection, staged reasoning enabled consistent decision-making.
In deployment, specialty-level adaptation enabled reuse across tasks.
This reflected a broader shift in clinical AI:
From isolated models → to integrated systems
From single predictions → to structured decision processes
From task-specific pipelines → to reusable model components
Within nMAS, the unified approach enables models to be selected, adapted, and deployed within a coordinated system, bridging the gap between model development and real-world use.
How This Fits Into Nimblemind’s Approach
At Nimblemind, we focus on building systems that can operate over multimodal clinical data and integrate AI into real-world workflows. This work reflects that approach by evaluating MedGemma as part of the nMAS, which coordinates structured prompting, model selection, and deployment across clinical tasks.
Rather than optimizing for a single benchmark, the goal was to create systems that were:
Transparent - decisions can be inspected and audited
Scalable - new tasks can be added without redesigning the system
Maintainable - fewer models reduce operational complexity
By combining model selection and deployment within nMAS, we moved closer to clinical AI systems that are both effective and practical in real-world settings.