Clinical AI Standards Working Group
With MLCommons, we are developing the standards, real-world clinical tasks, data and evaluation metrics that allow clinical, patient-facing AI systems to be assessed for a variety of deployment scenarios.
With MLCommons, we are developing the standards, real-world clinical tasks, data and evaluation metrics that allow clinical, patient-facing AI systems to be assessed for a variety of deployment scenarios.
Goal: Reliable and cost-effective assessment that matches real-world deployment outcomes. What if we had a capability assess AI capabilities with deployment-like realism and the reliability and reproducibility at very low cost.
With MLCommons, we are developing the standards, real-world clinical tasks, data and evaluation metrics that allow clinical, patient-facing AI systems to be assessed for a variety of deployment scenarios.
We are bringing together experts from clinical care and AI to develop 1) an open and standard approach to assessing clinical AI systems and 2) a process for certifying assessors and their implementations of the standard. This includes the methods, the data and a reference software platform for conducting evaluations.
Define the clinical use cases, deployment scenarios, and performance requirements that evaluation standards must cover. Engage clinicians, health systems, and regulators to align on what "safe and effective" means in practice.
Identify, curate and annotate representative datasets that reflect the diversity of patient populations and care settings. Establish data governance and privacy-preserving sharing agreements.
Create a reference platform that allows models and agents to be tested using real clinical data but under strict privacy protections.
Conduct evaluations using proposed protocols, scoring and the reference platform and compare with real-world outcomes. Use real-world outcomes to refine the standards.
Publish standards for evaluation and certification for broad community use. Support adoption by AI developers, health systems, assessors and regulators as a shared, reproducible evaluation baseline.
Goal: Reliable and cost-effective assessment that matches real-world deployment outcomes. What if we had a capability assess AI capabilities with deployment-like realism and the reliability and reproducibility at very low cost.
Our current approaches to assessment are based on community testing methodologies from data science and cyber security.
We often employ shared/common task benchmarks [Donoho 2017] as developed in the AI/ML research community over 50+ years that test the ability of AI models to perform standardized tasks on common test sets that aren't used during training and development.
✗ High NRE and build time required to collect and label ground truth on representative tasks with enough trials to be statistically valid.
✗ Constant, costly refresh needed: Systems learn from testing; test reliability degrades over time.
✗ Reliability limited by task representativeness and data freshness.
To assess the weaknesses and risks of these models we often use expert red-teams who, guided by their expertise, probe AI models in ways designed to elicit errors and errant behaviors.
✗ High cost: expert human testers required
✗ Scale limited by domain knowledge and expertise required to actively discover vulnerabilities.
✗ Limited reproducibility and reliability from tester availability/variability
Test in deployment. Pay the up-front cost to install, integrate and deploy a specific system from a specific vendor, train users, assess outcomes.
✓ Representative: Nothing is more realistic than testing in use, if pilots vendors and buyers are patient enough to endure initial deployment pains.
✗ Cost for both buyers and sellers: IT integration with user/vendor systems, user training, support, etc.
✗ Low reproducibility/reliability: Most pilots are too small and too uncontrolled to be indicative of long-term outcomes.
The result: AI buyers and vendors engage costly pilots that fail despite promising pre-deployment benchmarks.
Specialized AI models of user built from an extensive corpus of real user profiles and data histories. These models are capable of generating interactions with AI systems in the form of dialog and data.
Specialized AI models that explicitly probe for error modes and vulnerabilities.
The ideal assessment is one that predicts outcomes of deployments. We build empirical models that learn the relationship between automated testing/assessment and real deployment outcomes.
If we're successful, AI systems can be reliably and regularly tested at low costs prior to deployment and deployed with predictable outcomes.
Learn more about our work.
Contact Us