Concerns raised over accuracy of Whisper transcription tool in healthcare

Researchers found OpenAI’s Whisper AI tool sometimes generates fabricated sentences in medical transcriptions.

Whisper, OpenAI, AI tool, scrutiny

An AI transcription tool called Whisper, developed by OpenAI and used by thousands of clinicians and health systems, has come under scrutiny after researchers found it sometimes produces inaccurate transcriptions. Whisper, which powers the medical transcription tool from the company Nabla, has reportedly transcribed around 7 million medical conversations. While it accurately summarises many doctor-patient exchanges, researchers from Cornell University and the University of Washington discovered instances where the AI-generated entirely fabricated sentences, sometimes even adding irrelevant or nonsensical phrases.

The study, which was presented at the Association for Computing Machinery FAccT conference in Brazil in June, highlighted that Whisper made errors in about 1 percent of transcriptions, often producing ‘hallucinations’ — fabricated statements in response to silences during conversations. These inaccuracies were especially common in audio samples featuring patients with aphasia, a language disorder that results in frequent pauses. In one case, Whisper inserted phrases that were more typical of a YouTube video, such as “Thank you for watching!”

Nabla, aware of the issue, has stated it is working on solutions to mitigate these hallucinations. In response, OpenAI emphasised its commitment to reducing such errors, particularly in high-stakes situations like healthcare. An OpenAI spokesperson noted that Whisper’s usage policies discourage its application in critical decision-making contexts and that guidance for open-source use advises against deployment in high-risk domains.

The study’s findings underscore the complexities of applying AI tools in sensitive settings like healthcare, where precise communication is vital. With Whisper being used across 40 healthcare systems, the issue raises broader questions around the suitability of AI transcription tools in medical environments and the ongoing need for oversight in their deployment.