Recent advancements in artificial intelligence, particularly the emergence of large language models (LLMs), have ignited considerable optimism regarding their potential to revolutionize healthcare access. For sectors like medical tourism and health tourism, where the provision of timely and accurate information to international patients is paramount, these technological innovations appear to offer a transformative pathway to democratize medical knowledge and bring care closer to individuals, regardless of their geographical location. Indeed, the proliferation of sophisticated LLMs, exemplified by platforms such as OpenAI’s ChatGPT, could theoretically empower individuals to undertake initial health assessments, receive tailored medical guidance, and even manage chronic conditions without needing immediate professional clinical intervention. Anecdotal accounts of patients successfully utilizing LLMs to self-diagnose conditions, often after traditional avenues proved inconclusive, are becoming increasingly prevalent, and surveys indicate a growing trend of individuals consulting AI chatbots for sensitive health inquiries, with a notable proportion of American adults engaging with them monthly.

The Discrepancy Between AI Benchmarks and Real-World Application in Global Healthcare

While LLMs now demonstrate remarkable proficiency in various medical tasks, achieving scores on par with passing the US Medical Licensing Exam and producing clinical documents rated as equivalent to, or even superior to, those drafted by human physicians, their integration into actual clinical settings has encountered significant hurdles. This disjuncture between theoretical capability and practical performance is a critical consideration for any healthcare destination aiming to leverage AI for international patient care. Editorial opinion suggests that this challenge highlights a fundamental truth: excelling in controlled, in silico medical tasks does not automatically translate into accurate or effective performance within dynamic clinical environments, even when under the direct guidance of medical professionals. For instance, research has shown that radiologists assisted by AI did not exhibit improved performance in interpreting chest X-rays compared to working unassisted, and both groups performed worse than the AI operating autonomously. Similarly, another investigation revealed only marginal improvements in diagnostic accuracy for physicians supported by LLMs over their unassisted counterparts, with both groups again lagging behind the LLMs working independently. These findings underscore that merely equipping healthcare practitioners with highly capable AI systems is often insufficient to provide meaningful assistance in complex tasks, as professionals frequently struggle to appropriately evaluate and integrate AI-generated recommendations, thereby limiting the potential benefits of AI assistance for quality of care.

LLMs as the “New Front Door” for Cross-Border Healthcare: A Strategic Imperative?

Despite these challenges, the concept of LLM-powered chatbots serving as a ‘new front door’ to healthcare, particularly for patients who may lack extensive medical expertise or face barriers to traditional access, has gained considerable traction. This model holds significant strategic appeal for the medical tourism industry, offering a potential solution to broaden access to specialized medical knowledge and alleviate pressure on overburdened health systems, thereby facilitating patient travel and initial inquiries for cross-border healthcare. The prospects of LLMs directly advising patients have, however, elicited a range of opinions from medical experts, who cite concerns regarding oversight, liability, and ethical implications, while simultaneously acknowledging the potential advantages of extending support beyond conventional clinical settings. In response to this perceived opportunity, private enterprises are investing substantially in developing language models specifically tailored for healthcare applications, aiming to enhance international patient care.

Unpacking Human-LLM Interaction Failures: A UK Study’s Revelations

To thoroughly investigate whether LLMs can reliably support the general public and genuinely enhance patient access to care, a comprehensive study involving 1,298 participants in the UK was undertaken. Each participant was tasked with identifying potential health conditions and recommending an appropriate course of action (disposition) for one of ten distinct medical scenarios. These scenarios were meticulously crafted by a panel of three physicians who reached unanimous agreement on the correct dispositions. Subsequently, a separate group of four physicians provided differential diagnoses. Participants were then randomly assigned to one of four experimental groups, with demographic stratification ensuring national representativeness. Three treatment groups received assistance from an LLM (GPT-4o, Llama 3, or Command R+), offering a diverse array of models capable of accessing medical information. The control group, conversely, was instructed to utilize any methods they would typically employ at home, such as internet searches, which are common practices for international patients exploring healthcare destination options.

Disparity in Performance: LLMs Alone vs. LLMs with Users

The study’s initial findings revealed a stark contrast: despite the selected LLMs demonstrating high proficiency in accurately identifying dispositions and conditions when tested in isolation, participants struggled significantly to leverage these tools effectively. Editorial opinion strongly suggests this points to a critical flaw not in the AI’s knowledge base, but in the interface and interaction design for end-users, especially for those considering wellness tourism or medical tourism where self-assessment might be a first step.

  • Standalone LLM Efficacy: When scenarios and questions were directly posed to the models, they were able to suggest at least one relevant condition in an impressive 94.7% of cases for GPT-4o, 99.2% for Llama 3, and 90.8% for Command R+. Their accuracy in recommending dispositions ranged from 48.8% to 64.7%. These figures confirm the models’ inherent capacity to provide valuable medical information, significantly outperforming random guessing and aligning with their strong performance on other medical benchmarks.

  • Human-LLM Interaction Shortfalls: However, participants interacting with LLMs were significantly less likely than the control group to correctly identify at least one relevant medical condition (with p < 0.001 for all three models). On average, they identified fewer relevant conditions. The control group exhibited 1.76 times higher odds of identifying a relevant condition than the aggregate of LLM users, and were 1.57 times more likely to identify serious ‘red flag’ conditions. Furthermore, participants utilizing LLMs showed no statistically significant differences in disposition accuracy compared to the control group. While the overall correct response rate of 43.0% surpassed a random guessing baseline, a majority of participants still selected an incorrect disposition. Both LLM users and the control group tended to underestimate the severity of their conditions. Crucially, participants consistently performed worse than when the LLMs were directly provided with the scenario and task, highlighting that powerful standalone AI performance does not guarantee strong performance when coupled with human users seeking global healthcare guidance.

The Breakdown in User-LLM Dialogue

To pinpoint the factors contributing to these performance issues, researchers meticulously analyzed the transcripts of participant interactions with the LLMs. This analysis unveiled critical communication breakdowns:

  • Incomplete User Input: Users frequently failed to provide the models with sufficient initial information to arrive at an accurate recommendation. In a significant number of sampled interactions, initial messages contained only partial details, requiring subsequent user input to complete the picture.

  • LLM Misinterpretation and Errors: Even when LLMs suggested relevant conditions (occurring in 65.7% to 73.2% of cases, still lower than standalone performance), users did not consistently incorporate these suggestions into their final responses. This indicates a secondary communication failure. Furthermore, LLMs generated various forms of misleading or incorrect information. In some instances, initially correct responses were undermined by new, incorrect information after users provided additional details. Other cases saw LLMs narrowly expanding on non-central terms within a user’s query, leading to irrelevant advice. Contextual errors were also observed, such as recommending calling a partial US phone number and, in the same interaction, suggesting ‘Triple Zero’, the Australian emergency number. Alarmingly, LLMs also demonstrated inconsistency in responding to semantically similar inputs, with two users describing symptoms of a subarachnoid hemorrhage receiving diametrically opposed and potentially dangerous advice.

  • User Interaction Strategies: Participants employed diverse strategies when engaging with LLMs. Some predominantly asked closed-ended questions, limiting the breadth of LLM responses. Others, in justifying their choices, appeared to anthropomorphize the LLMs, attributing human-like confidence to the AI. Conversely, some users deliberately withheld information to test the veracity of the model’s suggestions. Editorial opinion emphasizes that these varied interaction patterns underscore the complexity of designing intuitive and reliable AI tools for the general public, especially for delicate matters of health.

The Inadequacy of Current Benchmarks for Interactive AI

The study rigorously demonstrated that conventional benchmarks, often relied upon to ensure the safety and reliability of LLMs before deployment, are insufficient predictors of human-LLM interaction failures. This has profound implications for the responsible rollout of AI in global healthcare and patient travel decision-making.

  • Medical Knowledge Benchmarks: Medical knowledge is typically assessed using questions from licensing examinations, such as the MedQA benchmark. When LLMs were scored on a targeted subset of MedQA questions relevant to the study’s scenarios, they consistently achieved higher accuracy than human participants using the same LLMs in the main study. While models typically met the approximate passing score of 60%, benchmark scores exceeding 80% still corresponded to human experimental scores below 20% in several instances. This striking divergence indicates that success in structured question-answering tasks does not guarantee effective application of that information in real-world, interactive scenarios.

  • Simulated Patient Interactions: Even simulations designed to mimic user interactions with LLMs failed to accurately predict human-LLM interaction failures. In a variant of the human study where LLMs replaced human participants, these simulated users generally performed better than their human counterparts, with less variation in results. However, the distribution of these results did not mirror human variability, and the correlation between simulated and real participant scores was weak or non-existent. This crucial finding reinforces that simulated participants do not accurately reflect the complexities of human-LLM interactions, making direct human involvement in safety testing indispensable for international patient care applications.

The Path Forward: Human-Centric Design for Medical AI in a Global Context

These findings collectively underscore the significant challenges inherent in deploying LLMs for direct public medical assistance. Despite the high proficiency of LLMs when operating autonomously, the combination of LLMs and human users proved no more effective than the control group in assessing clinical acuity and was notably worse at identifying relevant conditions. This extends previous research indicating that LLMs do not enhance clinical reasoning in physicians to the general public, a critical insight for healthcare destination providers considering AI integration. The study identifies the transmission of information between the LLM and the user as a particular vulnerability, stemming from both users providing incomplete information and LLMs failing to effectively convey correct suggestions.

Editorial opinion holds that addressing these issues requires a multi-faceted approach, particularly for the nuanced demands of medical tourism and cross-border healthcare:

  1. Enhancing Information Communication: LLMs typically offered an average of 2.21 possible options, leaving users to make the final decision—a task at which they performed poorly. Given that LLMs alone often outperform most users, significant improvements in how information is communicated from LLMs to users are imperative. This necessitates interactive, multi-turn evaluations to better understand and refine these capabilities.
  2. Improving User Information Elicitation: Much like a real doctor-patient consultation, users in this study chose what information to share with LLMs, leading to instances where models lacked sufficient data for accurate advice. Future patient-facing AI systems must develop