ChatGPT’s accuracy matters in healthcare, Stanford’s Nigam Shah says

[ad_1]

Nigam Shah, chief data scientist at Stanford Health Care and co-founder of the physician consultation service platform Atropos Health, discusses necessary improvements for generative artificial intelligence models like GPT-4 and potential clinical applications for the technology.

Related: Unpacking ChatGPT’s early uses in healthcare

Many industry observers have been discussing the potential use of ChatGPT in the healthcare setting. What’s the difference between ChatGPT and artificial intelligence models like GPT-3.5 and GPT-4?

GPT-3.5 and GPT-4 are large language models, or foundation models. ChatGPT is a web application that uses GPT-4 as its internal “engine” to power the conversation between you and the bot.

You worked on a study, whose preliminary results appeared in the arXiv repository, that compared GPT-3.5 and GPT-4 models’ answers to clinical questions with “known truth” responses provided by a consultation service. You found that less than 20% of the large language model responses agreed with the “known truth” answers. What diagnostic uses of the models were you hoping to investigate?

There are tons of uses for these models in healthcare. Diagnosis is one of the higher-bar ones, in terms of needing accuracy. When a physician is at the bedside and they have an information need, they would typically consult their colleagues or they might consult UpToDate or PubMed. We wanted to assess if a large language model could help, and whether the response is safe and agrees with what we believe to be true.

Related: Epic, Microsoft bring GPT-4 to EHRs

What are the possible safety risks of using generative AI models in clinical care?

Safety is a very tricky concept. [In our study,] we had 66 questions, and 12 physicians reviewed the models’ responses to provide an opinion on whether they had any chance of doing patient harm. None of the answers had a majority vote—where more than six of the physicians out of 12 said the response was harmful. But about 5% of the total 3,000 or so assessments by doctors said, “Yeah, maybe this [answer] looks a bit fishy. This could be harmful.” Usually, the nature of the harm is some sort of fabricated paper or a citation that doesn’t exist. It’s a generative model, so it makes stuff up, often called a hallucination. Safety-wise, I think both models came out pretty darn good.

How might these hallucinations lead to possible patient harm?

There are varying degrees. In some cases, it would produce a citation that does not exist. It gives you the illusion of having factual data, when you have none. The broader category [for concern] is increasing [physician] confusion. These questions [for the model] arise when we didn’t know what the right answer was. Then you have someone—or, in this case, a bot—coming in and providing an opinion. And now you have more confusion. Some could view the fact-checking burden that has now been created as a potential harm.

What areas do the GPT models need to improve upon in order to reduce disagreements between its responses and “known truths”?

A good place [for the model interface] to start would be to provide citations or backup for whatever summary is being provided: something whose veracity can be checked. Another way to handle it is to tone down the generative ability. There are some parameters that you can fiddle with. When we want factual accuracy, we might want it to be less creative.

What ethical considerations should providers and health systems be aware of when using generative AI models?

The ethics around this are still being worked out. If you go to an urgent care, and the doctor ends up searching Google, is that ethical? These days, most people would say, “Yeah, might as well look up what’s the latest [information] instead of just guessing.”

At the minimum, we should disclose when a generative AI [tool] is being used, so that when somebody sees a fishy response, they have an alert saying, “Maybe I should think about this, independently fact-check it, and avoid this sort of unquestioning trust in the output.”

Where would you like to see the healthcare industry go with its use of generative AI models, while still ensuring efficacy and patient safety?

We need a staggered approach. There’s a category of use cases that are relatively no-brainers: What time is the clinic open? Is this covered in my insurance? As long as we can ensure factual accuracy and it’s not hallucinating, why should a human have to answer that?

We might go a little bit into answering patient messages, then we can go into medical scribing, and we can keep going up the value chain. Then we might reach diagnosis, and the ability to provide a second opinion or a summary of the literature.

But we have to start experimenting, because without trying these things out, it’s hard to make a proper opinion. And we can’t really do five-year [randomized control trials] on these things, because they change every two weeks. 

This interview has been edited for length and clarity.

[ad_2]

Source link