In a recent study published in the journal The Lancet Digital Health, scientists in the United States assessed the effectiveness and challenges of artificial intelligence (AI) in clinical practice by analyzing randomized controlled trials, highlighting the need for more diverse and integrated research approaches.
Review: Randomized controlled trials evaluating artificial intelligence in clinical practice: a scoping review. Image credit: Kundra / Shutterstock
Record
The role of artificial intelligence in healthcare has expanded significantly over the past five years, showing potential to match or exceed the performance of clinicians in a variety of specialties. However, most AI models have been subjected to back-testing rather than real-world testing. Of the nearly 300 AI-enabled medical devices approved by the United States (US) Food and Drug Administration (FDA), only a few have been evaluated through prospective randomized controlled trials (RCTs). This gap in real-world testing highlights concerns about the reliability and effectiveness of AI, with issues such as alert fatigue from faulty AI predictions as evidenced by a decay model. Further research is needed to validate the effectiveness of AI in the real world, address biases, and ensure its safe, fair, and effective integration into clinical practice.
About the study
From January 1, 2018 to November 14, 2023, databases such as SCOPUS, PubMed, CENTRAL, and the International Clinical Trials Registry Platform were systematically searched for the rise of modern artificial intelligence in clinical trials. Search terms included “artificial intelligence”, “clinician” and “clinical trial”, with further studies identified through manual review of relevant publication references.
Inclusion criteria were specific for RCTs using significant components of artificial intelligence, defined as non-linear computational models such as decision trees or neural networks, which should be integrated into clinical practice and impact patient management. Exclusions included studies using linear models, secondary studies, abstracts and incomplete interventions. This methodology follows the Preferred Reference Evidence for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for scoping reviews and is registered in the International Register of Candidate Systematic Reviews (PROSPERO).
Publications were initially screened using the Covidence Review software, focusing on titles and abstracts. Two independent investigators performed the screening, with subsequent full-text reviews. Data extraction was completed in Google Sheets by one researcher and verified by another, with any disagreements resolved by a third. Information was collected on study location, participant characteristics, clinical tasks, primary endpoints, time performance, comparators, outcomes, type of AI, and origin. Studies were categorized by primary endpoint group, clinical area or specialty, and AI data type.
Study authors were not contacted for additional information, and due to the varied nature of tasks and endpoints between studies, no meta-analyses were performed. Instead, descriptive statistics were used to provide an overview of the characteristics of the trials included in this review.
Study results
After removing duplicates, the online search for the scoping review produced 10,484 unique records spanning January 1, 2018, to November 14, 2023. This process involved retrieving 6,219 study records and 4,299 trial records. Initial screening of titles and abstracts limited the selection to 133 articles that underwent full-text review. Subsequent exclusions left 73 studies, supplemented by an additional 13 articles identified through secondary reference checking, totaling 86 unique RCTs for inclusion.
Of these 86 RCTs, a significant proportion (43%) focused on gastroenterology, followed by radiology (13%), surgery (6%) and cardiology (6%). Gastroenterology trials have primarily used video-based deep learning algorithms to assist clinicians, primarily in evaluating diagnostic performance or performance. Most gastrointestinal trials were concentrated in four research groups, highlighting the lack of diversity in trial conduct. Geographically, 92% of trials were conducted in individual countries, with the US and China leading the number of trials, but focusing on different specialties.
Trials typically involved single centers and averaged 359 participants. Participant demographics such as age and gender were consistently reported, but race or ethnicity was included less frequently.
Diagnostic efficiency was the most common primary endpoint, followed by metrics related to care management, patient behavior and symptoms, and clinical decision making. Specifically, AI interventions in insulin dosing and hypotension monitoring demonstrated improvements in clinical management by optimizing time within target limits. Other AI applications have positively impacted patient behavior, as seen in trials that increased adherence to referral recommendations through direct AI-generated predictions.
The majority of trials evaluated deep learning systems for medical imaging, especially video-based systems used in endoscopy. The use of AI varied across different types of data, including structured data from electronic health records and waveform data. In terms of development, most AI models come from industry, with academia also playing an important role.
Analyzes of the results revealed that a significant number of trials achieved significant improvements in their primary endpoints when AI was used to assist clinicians or compared to usual care. However, a small group of trials used non-inferiority designs to demonstrate that AI systems could match the performance of unassisted clinicians or usual care.
Uptime measurements varied between trials, with some reporting significant decreases while others saw increases or no change. Gastroenterology was primarily the most studied specialty in terms of operating time effects, with mixed results regarding the impact of AI on operational efficiency.