Artificial intelligence (AI) has made significant progress in visual analysis, particularly since the first image recognition models emerged nearly 15 years ago. However, these models are not perfect. In the medical field, especially when it comes to detecting cancer, pattern recognition is crucial. Radiologists use X-rays and magnetic resonance imaging to make tumors visible, while pathologists examine kidney, liver, and other samples under a microscope to identify specific patterns. These patterns help determine the severity of cancer, potential treatment effectiveness, and possible cancer spread.
In theory, AI should be a great aid in this process. “Our task is pattern recognition,” says Andrew Norgan, a pathologist and medical director of the digital pathology platform at the Mayo Clinic. “We examine the slides and gather information that has proven to be important.”
There have been many attempts to develop an AI model for cancer examination. At least seven attempts were made last year alone, yet all remain experimental. What is required for these models to be effective in real-world clinical settings?
Insights into such a model are provided by results from the AI health company Aignostics and the Mayo Clinic, published on the preprint platform arXiv. Although not yet peer-reviewed, the work reveals much about the challenges of introducing such a tool into clinical environments.
The model, named Atlas, was trained on 1.2 million tissue samples from 490,000 cancer cases. Its accuracy was compared to six other leading AI pathology models. These models compete in tests like classifying breast cancer images or grading tumors, with predictions compared to human pathologists’ answers.
Atlas outperformed the competing models in six out of nine tests. It scored highest in grading colon cancer tissue, matching human pathologists’ conclusions in 97.1% of cases. However, in classifying prostate cancer biopsies, Atlas achieved only 70.5%, lower than other models’ high values. On average, the model matched human experts’ answers in 84.6% of cases.
The best way to understand what happens with cancer cells in tissue is by having pathologists examine the sample. AI models are measured against this standard. The best models approach human performance in specific recognition tasks but lag in others. How good must a model be to be clinically useful?
According to Carlo Bifulco, Chief Medical Officer at Providence Genomics, 90% of AI models might not be good enough. However, even imperfect models can be useful in speeding up diagnosis.
The primary issue with AI models is training data. Less than 10% of pathology practices in the US are digitized, meaning tissue samples are analyzed under a microscope and stored without digital documentation. European practices are more digitized, and efforts are underway to create shared datasets for training AI models, but there’s still limited data.
Without diverse datasets, AI models struggle to recognize the wide range of anomalies that human pathologists interpret. This is also true for rare diseases. Publicly accessible databases might contain only 20 samples of rare diseases over ten years.
The Mayo Clinic foresaw this data shortage and decided to digitize all its pathology practices, including 12 million slides stored over decades. They hired a company to build a robot to take high-resolution photos of the tissues, processing up to a million samples per month. This effort resulted in the 1.2 million high-quality samples used to train the Mayo model.
Another challenge is that digital tissue samples are large. Biopsy samples are tiny but magnified to the point where digital images contain over 14 billion pixels. This requires significant storage and forces decisions about which parts of the image to use for training, potentially overlooking important cells. The Mayo Clinic used a method called tiling, taking many snapshots of the same sample for the AI model. The selection of these tiles is both an art and a science, and the best methods are still unclear.
The third issue is determining which benchmarks are most important for a cancer detection AI model. The Atlas researchers tested their model on molecular benchmarks, which involve finding clues from tissue images to predict molecular-level events. For example, mismatch repair genes are crucial for cancer as they catch DNA replication errors during cell division. If these errors go unnoticed, they can promote cancer development.
Currently, Atlas’s average score for molecular tests is 44.9%. While the best AI performance so far, it shows there’s still a long way to go. According to Bifulco, Atlas represents incremental progress, but more significant advances require different models and larger datasets.