Sepsis, a life-threatening reaction to an infection, is one of the leading causes of hospital deaths in the U.S., and is a key quality measure. Because of this, many hospitals have adopted software tools to predict the onset of sepsis, but there’s little published data to evaluate how accurate they actually are.
Researchers at the University of Michigan Medical School aimed to rectify that. They evaluated a sepsis prediction tool developed by Epic Systems that has been implemented at hundreds of hospitals. The results were concerning: They found the model’s accuracy was “substantially worse” than Epic claimed, only classifying patients correctly 63% of the time, according to a published paper in JAMA Internal Medicine.
In addition, the model raised sepsis alerts on nearly a fifth of all hospitalized patients, potentially contributing to alert fatigue. Despite the number of alerts, it only identified sepsis in 7% of patients whose diagnosis was missed by a clinician.
“This means that the model was firing significantly often but only had minimal benefit above usual clinical practice,” Dr. Andrew Wong, the study’s first author, wrote in an email. “Alert fatigue is important because it can interfere with providers’ ability to deliver care, dilutes the importance of other alerts in the system, and contributes to physician burnout.”
Dr. Anand Habib, a physician at the University of San Francisco who wrote a commentary on the paper, said in a podcast interview that he had personally seen the effects of alert fatigue. For example, the sepsis tool generated alerts for a patient with rapidly progressive interstitial lung disease, because he was short of breath and had a rapid heart rate.
“It seemed like 2 to 3 times a day, there would be warnings in our Epic health record system saying ‘is this patient septic?’” he said. “I think it was creating a lot of moral distress and concern on the part of the nursing staff, given that every time this sort of warning sign pops up in Epic, they’re wondering, ‘are we failing this patient in terms of quality of care?’”
The retrospective study was conducted using data from more than 27,000 patients hospitalized at Michigan Medicine from December 2018 to October 2019. It only included one medical center, but had a relatively diverse cohort of patients.
The Epic Sepsis Model calculates risk scores based on patients’ vital signs, laboratory values and other information pulled from electronic health records. It’s calculated every 15 minutes. Hospitals can decide which score to use to generate an alert, with lower scores generating more alerts and a higher score threshold potentially missing some patients. Researchers at the University of Michigan used a threshold of 6, where patients would still be considered to have a high risk of sepsis.
Part of the reason for the discrepancy in accuracy might have been because Epic’s model was trained using hospital billing codes to measure sepsis outcomes, Wong said. The onset of sepsis was also defined as the time the clinician intervened.
“Historically, hospital billing codes have been notorious for being inaccurate or not reflective of the actual disease processes of hospitalized patients,” he wrote. “Because of the inaccurate nature of billing codes, the model is essentially trained to identify which patients will be billed for sepsis, not which patients actually develop the clinical criteria for sepsis.”
In response, Epic maintained that its model still helped clinicians provide early interventions for patients, and that its full mathematical formula and model inputs are available to administrators.
It also pointed to a preprint from researchers at the University of Colorado Health, where researchers found that Epic’s model was more accurate than their current early warning score system.
In an emailed statement, the company said that the study “did not take into account the analysis and required fine tuning that needs to occur prior to real-world deployment of the model.”
The study not only raises broader questions about sepsis prediction tools, but also about how healthcare algorithms are used more broadly at hospitals, often with little regulatory review or outside scrutiny.
Most decision-support software tools are classified as class 2 medical devices, which means that manufacturers only need to establish that they are “substantially equivalent” to an existing device to market them.
In cases where the model isn’t being marketed, many of these models haven’t been reviewed by the FDA at all. A survey conducted by MedCity News this year found several hospitals used algorithms to predict Covid-19 diagnosis or progression, but none of them were cleared by the FDA.
As more proprietary models are developed and deployed, the onus is currently on hospitals to validate that these tools actually work for their patient population.
“We encourage hospitals to perform their own internal validation studies to see how effective the models are performing in their own systems to guide whether they should in fact be deploying the model,” Wong wrote. “Furthermore, it is essential that we externally validate these types of proprietary prediction models before adopting them out of ease or convenience.”
Photo credit: AnuStudio, Getty Images