The project began with a vexing problem. Imaging tests that turned up unexpected issues — such as suspicious lung nodules — were being overlooked by busy caregivers, and patients who needed prompt follow-up weren’t getting it.
After months of discussion, the leaders of Northwestern Medicine coalesced around a heady solution: Artificial intelligence could be used to identify these cases and quickly ping providers.
If only it were that easy.
It took three years to embed AI models to flag lung and adrenal nodules into clinical practice, requiring thousands of work hours by employees who spanned the organization — from radiologists, to human resources specialists, to nurses, primary care doctors, and IT experts. Developing accurate models was the least of their problems. The real challenge was building trust in their conclusions and designing a system to ensure the tool’s warnings didn’t just lead providers to click past a pop-up, and instead translated to effective, real-world care.
“There were so many surprises. This was a learning experience every day,” said Jane Domingo, a project manager in Northwestern’s office of clinical improvement. “It’s amazing to think of the sheer number of different people and expertise that we pulled together to make this work.”
Ultimately, the adrenal model failed to produce the necessary level of accuracy in live testing. But the lung model, by far the most common source of suspicious lesions, proved highly adept at notifying caregivers, paving the way for thousands of follow-up tests for patients, according to a paper published last week in NEJM Catalyst. Additional study is needed to determine whether those tests are reducing the number of missed cancers.
STAT interviewed employees across Northwestern who were involved in building the algorithm, incorporating it into IT systems, and pairing it with protocols to ensure that patients received the rapid follow-up that had been recommended. The challenges they faced, and what it took to overcome them, underscores that AI’s success in medicine hinges as much on human effort and understanding as it does on the statistical accuracy of the algorithm itself.
Here’s a closer look at the players involved in the project and the obstacles they faced along the way.
To get the AI to flag the right information, it needed to be trained on labeled examples from the health system. Radiology reports had to be marked up to note incidental findings and recommendations for follow-up. But who had the time to mark up tens of thousands of clinical documents to help the AI spot the telltale language?
The human resources department had an idea: Nurses who had been put on light duty due to work injuries could be trained to scan the reports and pluck out key excerpts. That would eliminate the need to hire a high-priced third party with unknown expertise.
However, highlighting discreet passages in lengthy radiology reports is not as easy as it sounds, said Stacey Caron, who oversaw the team of nurses doing the annotation. “Radiologists write their reports differently, and some of them will be more specific in their recommendations, and others will be more vague,” she said. “We had to make sure the education on how [to mark relevant excerpts] was clear.”
Caron met with nurses individually to orient them to the project and created a training video and written instructions to guide their work. Each report had to be annotated by multiple nurses to ensure accurate labeling. In the end, the nurses logged about 8,000 work hours annotating more than 53,000 distinct reports, creating a high-quality data stream to help train the AI.
The model builders
Developing the AI models may not have been the hardest task in the project, but it was crucial to its success. There are several different approaches to analyzing text with AI — a task known as natural language processing. Picking the wrong one means certain failure.
The team started with a model known as regular expression, or regex, which searches for manually defined word sequences within text, like “non-contrast chest CT.” But because of the variability in wording used by radiologists in their reports, the AI became too error-prone. It missed an unacceptable number of suspicious nodules in need of follow-up, and flagged too many reports where they didn’t exist.
Next, the AI specialists, led by Mozziyar Etemadi, a professor of biomedical engineering at Northwestern, tried a machine learning approach called bag-of words, which counts the number of times a word is used from a pre-selected list of vocabulary, creating a numeric representation that can be fed into the model. This, too, failed to achieve the desired level of accuracy.
The shortcomings of those relatively simple models pointed to the need for a more complex architecture known as deep learning, where data are passed through multiple processing layers in which the model learns key features and relationships. This method allowed the AI to understand dependencies between words in the text.
Early testing showed the model almost never missed a report that flagged a suspicious nodule.
“It’s really a testament to these deep learning tools,” said Etemadi. “When you throw more and more data at it, it gets it. These tools really do learn the underlying structure of the English language.”
But technical proficiency, though an important milestone, was not enough for the AI to make a difference in the clinic. Its conclusions would only matter if people knew what to do with them.
“AI cannot show up and give the clinicians more work,” said Northwestern Medicine’s chief medical officer, James Adams, who championed the project in the health system’s executive ranks. “It needs to be an agent of the frontline people, and that’s different from how health care technology of this past generation has been implemented.”
The alert architects
A commonly used vehicle for delivering timely information to clinicians is known as a best practice alert, or BPA — a message that pops up in health records software.
Clinicians are already bombarded with such alerts, and adding to the list is a touchy subject. “We kind of have to have our ducks in a row, because if it’s interruptive, it’s going to face some resistance from physicians,” said Pat Creamer, a program manager for information services.
The solution in this case was to embed the alert in clinicians’ inboxes, where two red exclamation marks signify a message requiring immediate attention. To reinforce trust in the validity of the AI’s alert, the relevant text from the original report was embedded within the message, along with a hyperlink that allows physicians to easily order the recommended follow-up test.
Creamer said the message also allows clinicians to reject the recommendation if other information indicates follow-up is not needed, such as ongoing management of the patient by someone else. The message can also be transferred to that other caregiver.
The most important part of the alert, Creamer said, was building it into the record-keeping system so that the team could keep tabs on each part of the process. “It’s not a normal BPA,” he said, “because it’s got programming behind it that’s helping us track the findings and recommendations throughout the whole lifecycle.”
And in cases where patients didn’t receive follow-up, they were ready with plan B.
The loop closers
The alert system needed a backstop to ensure that patients didn’t fall through the cracks. That challenge fell into the lap of Domingo, the project manager who had to figure out how to ensure patients would show up for their next test.
The first line of defense was a dedicated team of nurses tasked with following up with patients if the ordered test was not completed within a certain number of days. Given the difficulty of reaching patients by phone, however, they needed another option. The idea was floated of sending a letter to patients by mail, but some physicians worried that a notification of a suspicious lesion would induce panic, triggering a flood of nervous phone calls.
“The letter became one of my passions,” Domingo said. “It was something I really pushed for.”
The wording of the letter was especially tricky. She reached out to Northwestern’s patient advisory councils for input. “There was overwhelming feedback that we should alert them that there was a finding that may need follow-up,” she said. But a suggestion was made to add another clause noting that such findings are not always serious and may just require additional consultation. The letter is now sent to patient’s within seven days of the initial AI alert to physicians.
“From the limited number of complaints we’ve gotten,” Domingo said, “this was an important piece to help improve patient safety.”
Since the onset of the project, the AI has prompted more than 5,000 physician interactions with patients, and more than 2,400 additional tests have been completed. It remains a work in progress, with additional tweaks to ensure the AI remains accurate and that the alerts are finely-tuned. Some doctors remain skeptical, but others said they see a value in AI that wasn’t so clear when the project started.
“The bottom line is the burden is no longer on me to track everything,” said Cheryl Wilkes, an internal medicine physician. “It makes me sleep better at night. That’s the best way I can explain it.”