standardsassessment

Clinical AI Standards Working Group

With MLCommons, we are developing the standards, real-world clinical tasks, data and evaluation metrics that allow clinical, patient-facing AI systems to be assessed for a variety of deployment scenarios.

researchassessment

PROACT: Predict Real-world Outcomes via Automated Comprehensive Testing

Goal: Reliable and cost-effective assessment that matches real-world deployment outcomes. What if we had a capability assess AI capabilities with deployment-like realism and the reliability and reproducibility at very low cost.

standardsassessment

Clinical AI Standards Working Group

We are bringing together experts from clinical care and AI to develop 1) an open and standard approach to assessing clinical AI systems and 2) a process for certifying assessors and their implementations of the standard. This includes the methods, the data and a reference software platform for conducting evaluations.

Definition of clinical tasks and scoring

Define the clinical use cases, deployment scenarios, and performance requirements that evaluation standards must cover. Engage clinicians, health systems, and regulators to align on what "safe and effective" means in practice.
Data collection and annotation

Identify, curate and annotate representative datasets that reflect the diversity of patient populations and care settings. Establish data governance and privacy-preserving sharing agreements.
Development of the prototype platform

Create a reference platform that allows models and agents to be tested using real clinical data but under strict privacy protections.
Evaluation of testing protocol and platform

Conduct evaluations using proposed protocols, scoring and the reference platform and compare with real-world outcomes. Use real-world outcomes to refine the standards.
Standards release and stewardship

Publish standards for evaluation and certification for broad community use. Support adoption by AI developers, health systems, assessors and regulators as a shared, reproducible evaluation baseline.

researchassessment

PROACT: Predict Real-world Outcomes via Automated Comprehensive Testing

Today's AI assessment approaches are costly, hard to reproduce, and don't match deployment outcomes.

Our current approaches to assessment are based on community testing methodologies from data science and cyber security.

Benchmarking

We often employ shared/common task benchmarks [Donoho 2017] as developed in the AI/ML research community over 50+ years that test the ability of AI models to perform standardized tasks on common test sets that aren't used during training and development.

✗ High NRE and build time required to collect and label ground truth on representative tasks with enough trials to be statistically valid.

✗ Constant, costly refresh needed: Systems learn from testing; test reliability degrades over time.

✗ Reliability limited by task representativeness and data freshness.

Red Teaming

To assess the weaknesses and risks of these models we often use expert red-teams who, guided by their expertise, probe AI models in ways designed to elicit errors and errant behaviors.

✗ High cost: expert human testers required

✗ Scale limited by domain knowledge and expertise required to actively discover vulnerabilities.

✗ Limited reproducibility and reliability from tester availability/variability

Piloting

Test in deployment. Pay the up-front cost to install, integrate and deploy a specific system from a specific vendor, train users, assess outcomes.

✓ Representative: Nothing is more realistic than testing in use, if pilots vendors and buyers are patient enough to endure initial deployment pains.

✗ Cost for both buyers and sellers: IT integration with user/vendor systems, user training, support, etc.

✗ Low reproducibility/reliability: Most pilots are too small and too uncontrolled to be indicative of long-term outcomes.

The result: AI buyers and vendors engage costly pilots that fail despite promising pre-deployment benchmarks.

Our Approach

User simulator

Specialized AI models of user built from an extensive corpus of real user profiles and data histories. These models are capable of generating interactions with AI systems in the form of dialog and data.

Low Cost/high speed Full automation eliminates manual testing overhead

Data and privacy preserving Models are trained *simulate* users without leaking sensitive information.

AutoPen

Specialized AI models that explicitly probe for error modes and vulnerabilities.

Explicit testing for tail risks AI models derived from real vulnerabilities and exploits actively probe models under test.

Empirical estmation of deployment outcomes

The ideal assessment is one that predicts outcomes of deployments. We build empirical models that learn the relationship between automated testing/assessment and real deployment outcomes.

Extensive and realistic task evaluation during assessment Automated testing driven by patient simulators trained on real data extensively and realistically exercises systems

High Reliability Aggregation of historical test data and deployment outcomes enables statistically reliable estimation of risk

If we are successful...

If we're successful, AI systems can be reliably and regularly tested at low costs prior to deployment and deployed with predictable outcomes.

Learn more about our work.