But large language models are ‘diamonds in the rough’ with potential to empower clinicians, say experts.
The reliability of ChatGPT when answering health questions drops as low as 28% when provided with supporting evidence but remains as high as 80% when asked for yes or no answers, Australian research finds.
“If you compare the current consensus in Australia versus internationally, there’s probably more pessimism locally around the use of these AI technologies,” CSIRO principal research scientist and associate professor at the University of Queensland Bevan Koopman told Health Services Daily.
“These large language models, which are the underlying technology behind ChatGPT, are diamonds in the rough currently.
“They show amazing potential in terms of how they might help people access information or make health decisions based on the latest evidence.
“But we’re still at the stage where we’re trying to understand how these models work, how they should be deployed, and how they behave under different scenarios.”
The study, published in ACL Anthology, looked at the concurrence between ChatGPT responses to health-related questions and the correct response based on medical knowledge, depending on how the questions were asked.
The researchers found that ChatGPT was 80% accurate when asked for a yes or no answer to a health-based question like, “Can folic acid help improve cognition and treat dementia?”.
However, when provided with evidence either in support or contrary to the question, accuracy fell, even when the evidence supported the correct answer.
“We’re not sure why this happens,” said Professor Koopman, who co-authored the study.
“But given this occurs whether the evidence given is correct or not, perhaps the evidence adds too much noise, thus lowering accuracy.”
Accuracy fell to a measly 28%, when an “unsure” answer was allowed.
Dr Koopman said while the data showed that the accuracy of answers could be significantly degraded depending on how the question was asked, “overall, accuracy of 80% is really quite high for a dataset that has some pretty tricky questions”.
Dr Koopman said he hoped the research would motivate the development of healthcare specific LLMs that were trained using the latest medical literature and could provide source attribution.
Chair of the RACGP expert committee on practice technology and management Dr Rob Hosking said that while AI was already proving useful to clinicians, particularly as administrative or transcription tools for GPs, it seemed LLMs remained too immature to reliably answer health questions.
“It’s potentially very positive, but we’ve got to proceed with caution at the moment,” he said.
“As the CSIRO research has pointed out, currently, it’s making mistakes. And they’re pretty serious mistakes by the sound of it.”
But, if the technology was able to amalgamate information from the many guidelines and databases doctors currently referred to, it could save valuable time, added Dr Hosking.
“It’s got the potential, if we can get it right, to empower GPs to manage patients that they might otherwise have had to refer to specialists.”
Dr Koopman “100%” agreed that health information provided through LLMs would need to be interpreted by a health professional.
“A lot of the work that my team does at CSIRO is with doctors at Queensland Health and across the country, trying to empower their evidence-based decision making by providing them with access to the latest medical evidence.
“That’s really where LLMs can play a big role.
“The ever-growing amount of medical knowledge that any one clinician has to learn, retain and retrieve at critical points in times, if we can ease that burden and make it easy to access the latest evidence and understand how that relates to their particular patient, that’s really what our aim is in terms of applying this generative AI technology in healthcare setting.”