Introduction. Generative AI is revolutionizing patient education by simplifying complex medical concepts with personalized content. Within the healthcare sector, AI-powered systems, such as chatbots, are progressively utilized to provide customized and immediate patient education, marking a growing trend (1). The accuracy and clarity of AI-generated content are paramount, as misinformation can lead to poor clinical outcomes and decreased trust in healthcare providers. This study aims to extend the current understanding of AI-based Large Language Models (LLMs) efficacy in patient education by evaluating the accuracy and clarity of information provided by three AI models on lumbar disc herniation.
Methods. First, the n=10 most Frequently Asked Questions (FAQs) on the topic of lumbar disc herniation were selected from a larger pool of 133 questions defined both by the authors and the AI chatbots. The 10 FAQs included questions on postoperative care, clinical manifestations, symptoms, surgical outcomes, and prognosis, and were submitted to the following publicly accessible LLMs: 1. ChatGPT 3.5; 2. ChatGPT 3.5 with a specific prompt; 3. Google Bard. The answers provided by LLMs were provided to a group of independent raters using the online Google Forms application and subjected to a rigorous evaluation The LLMs' responses were evaluated by n=6 experienced spine surgeons using Google Forms, with a rating system ranging from "excellent" to "unsatisfactory." Participants also rated each LLM's exhaustiveness, clarity, empathy, and answer length on a 5-point Likert scale. The inter-rater reliability was assessed using Fleiss Kappa. Chi-square tests (χ2) were applied to test differences in the frequency of answer distribution among LLMs, raters, and FAQs. A Friedman test was applied to test differences among LLMS in exhaustiveness, clarity, empathy, and length. Statistical analysis was conducted using GraphPad Prism 9.5.1, with significance set at p<0.05.
Results. Exhaustiveness, clarity, empathy, and length of the answers were rated >3.5 on average, with no significant differences among LLMs. Further, no differences among LLMs were detected in the overall frequency of answer distribution (Figure 1A). The answers "excellent" ranged from 25% to 28%, "requiring minimal clarification" from 41% to 47%, "requiring moderate classification" from 17% to 20%, and "unsatisfactory" from 8% to 13%. Nevertheless, inter-rater reliability was not satisfied, and large differences among raters were detected in the frequency of answer distribution for ChatGPT 3.5 (p=0.023), prompted ChatGPT 3.5 (p<0.0001) and Google Bard (p=0.007) separately and mixed (p<0.0001; Figure 1B). Overall, ratings varied among the 10 answers (p=0.043) with Q2 (description of surgical techniques) showing the worst scores and Q4 (decision-making for surgical treatment) showing the best scores (Figure 2).
Discussion. LLMs provided answers with a generally good level of comprehensibility with no differences among ChatGPT 3.5, prompted ChatGPT 3.5, and Google Bard. However, in some cases, the raters highlighted unclear or missing scientific evidence in statements from the LLMs. Expectations of raters were satisfied (4.1/5) and surgeons also reported a generally positive attitude toward the use of LLMs in healthcare education (4.8/5). The large inter-rater variability among spine surgeons in answers warrants further investigation.