In a recent study published in Prostate Cancer and Prostate Diseasesa group of researchers evaluated the accuracy and quality of Chat Generative Pre-trained Transformers (ChatGPT) responses to lower urinary tract symptoms (LUTS) in men indicative of benign prostatic enlargement (BPE) compared to established urological references.
Study: Can ChatGPT provide high-quality patient information about male lower urinary tract symptoms suggestive of benign prostatic hyperplasia? Image credit: Miha Creative/Shutterstock.com
Record
As patients increasingly seek online medical guidance, major urological associations such as the Association of Urology (EAU) and the American Urological Association (AUA) provide high-quality resources. However, modern technologies such as artificial intelligence (AI) are gaining popularity due to their effectiveness.
ChatGPT, with more than 1.5 million monthly visits, offers a user-friendly chat. A recent survey showed that 20% of urologists used ChatGPT clinically, with 56% recognizing its decision-making potential.
Studies on the urological accuracy of ChatGPT show mixed results. Further research is needed to comprehensively evaluate the effectiveness and reliability of AI tools such as ChatGPT in providing accurate and high-quality medical information.
About the study
The present study reviewed the EAU and AUA patient information websites to identify key topics related to BPE by asking 88 relevant questions.
These questions covered definitions, symptoms, diagnostics, risks, management and treatment options. Each question was submitted independently in ChatGPT and responses were recorded for comparison with the reference material.
Two examiners classified ChatGPT responses as true negative (TN), false negative (FN), true positive (TP), or false positive (FP). Discrepancies were resolved by consensus or consultation with a senior expert.
Performance metrics, including F1 score, precision, and recall, were calculated to assess precision, with F1 score used for its reliability in assessing model accuracy.
General quality scores (GQS) were assigned using a 5-point Likert scale, assessing the veracity, relevance, structure, and language of ChatGPT responses. Scores ranged from 1 (false or misleading) to 5 (extremely accurate and relevant). The average GQS from the two examiners was used as the final score for each question.
Inter-examiner agreement on GQS scores was measured using the intra-class correlation coefficient (ICC), and differences were assessed by the Wilcoxon signed-rank test, with a p value of less than 0.05 considered significant. Analyzes were conducted using SAS version 9.4.
Study results
ChatGPT answered 88 questions in eight categories related to BPE. Specifically, 71.6% of questions (63 of 88) focused on the management of BPE, including conventional surgical interventions (27 questions), minimally invasive surgical therapies (MIST, 21 questions), and pharmacotherapy (15 questions).
ChatGPT generated responses to all 88 questions, totaling 22,946 words and 1,430 sentences. In contrast, the EAU website contained 4,914 words and 200 sentences, while the AUA patient guide had 3,472 words and 238 sentences. The AI generated responses were almost three times longer than the source materials.
Performance metrics of ChatGPT responses varied, with F1 scores ranging from 0.67 to 1.0, precision scores from 0.5 to 1.0, and recall from 0.9 to 1.0.
The GQS ranged from 3.5 to 5. Overall, ChatGPT achieved an F1 score of 0.79, a precision score of 0.66, and a recall score of 0.97. GQS scores from both examiners had a median of 4, with a range of 1 to 5.
The examiners found no statistically significant difference between the scores they assigned to the overall quality of the responses, with a p value of 0.72. They identified a good level of agreement between them, reflected by an ICC of 0.86.
conclusions
In summary, ChatGPT addressed all 88 queries, with performance metrics consistently above 0.5 and an overall GQS of 4, indicating high quality responses. However, ChatGPT’s responses were often excessively time-consuming.
Accuracy varies by subject, excelling in BPE concepts but less so in minimally invasive surgical treatments. The high level of inter-examiner agreement on the quality of the responses underlines the reliability of the assessment process.
As AI continues to evolve, it holds promise for enhancing patient education and support, but continued evaluation and improvement are necessary to maximize its utility in clinical settings.