An international analysis concludes that AI expands medical information, but with variable reliability

An international study with Spanish participation concludes that AI helps to expand medical information, but its reliability is irregular and must be supervised.

2 minutes

Add DEMÓCRATA to Google

Published

2 minutes

Fren arrives at Demócrata: Vinces' specialized AI assistant to understand politics, laws, and current public affairs

Fren2
Vinces' specialized AI integrates into Demócrata to simplify political and legislative complexity, offering context and interactive formats

Most read

An international study in which the Rey Juan Carlos University (URJC) of Madrid and the Henares University Hospital, in Coslada, have collaborated, determines that resorting to Artificial Intelligence (AI) to expand medical information "is useful", although "its reliability is variable and must always be verified with a healthcare professional".

According to URJC details, the research, in which experts from King's College and Solent University of London also participated, and which has been published in the specialized journal 'Artificial Intelligence in Medicine', "has evaluated how 'ChatGPT' or 'Gemini' respond to citizens' questions on topics such as the epidural".

The results show that "to answer these types of questions, the model with the best overall performance would be 'ChatGPT', followed by 'Gemini'", indicated URJC, which clarifies that "however, the quality of these models depends on the metric evaluated". "Although 'ChatGPT' shows the best data, two medium-sized models, 'OpenChat' and 'Phi-3', achieve comparable results, significantly improving other large models," stated the principal investigator, Marina del Barrio.

In Del Barrio's opinion, this "highlights the importance of the data with which they are trained versus the size of the model". The university also emphasizes that the work "has also focused on distinguishing between responses that can be trusted and those that can confuse patients and potentially alter their decision-making".

"The difficulty of the questions also affects the quality of the answers, with the most complex or controversial ones obtaining worse results," added the researcher, which implies that "this makes the models less reliable when answering sensitive questions". To carry out the analysis, "to gather all the information, the scientific team established 10 questions to pose to the different language models, each reformulated in different ways".

Methodology and AI models evaluated

"To do this, we relied on both literature and clinical practice, and all questions were reformulated in both Spanish and English," explained Del Barrio, specifying that "the objective of this was to test the ability of these models to understand and respond to different formulations." All of this, he remarked, "always with simple statements, like those a patient might write at home without prior knowledge of AI."

Subsequently, and according to the URJC, the models to be analyzed were chosen, including 'ChatGPT', 'Gemini', 'OpenChat', or the 'Phi-2' and 'Phi-3' versions, "and specialized medical models such as 'MedLlama' and 'Meditron'." "The metrics for evaluation focused on tangibility, reliability, sensitivity, safety, empathy, comprehensibility, and agreement with the expert," stated the institution, adding that "the more than 2,400 responses were agreed upon and manually reviewed with two experts to establish which were acceptable."

"The findings of this study open the door to developing more efficient and useful AI systems in Medicine, which can support professionals and patients, always under medical supervision," concluded the URJC, also highlighting that "the results question the idea that larger models are always better and suggest that training and data influence more than size."

Hola, soy Fren. ¿Cómo te ayudo?