Miracle or Mirage? A Recent Paper Hypes Machine Learning in Mammography
On typically inflated accuracy improvement claims, persistent uncertainty, and what's missing in the practice of informed consent
Using machine learning to decide which women to screen with mammography for early breast cancer detection “would reduce the long-run incidence of later-stage tumors by 40%” while also halving overdiagnosis, according to Daysal et al (“An Economic Approach to Machine Learning in Health Policy,” N. Meltem Daysal Sendhil Mullainathan Ziad Obermeyer Suproteem K. Sarkar Mircea Trandafir; January 31, 2024). Reduce false negatives while reducing false positives — what’s not to like?
The claimed finding would represent a breakthrough in advancing our use of what is perhaps the best-studied medical intervention ever to maximize its benefits and minimize its harms. Other recent cutting-edge research at the intersection of AI/ML and mammography has struggled to achieve this Holy Grail. Mass screenings for low-prevalence problems generally seem to be trapped in the accuracy-error trade-off that leaves us with what Stephen Fienberg co-chairing the 2003 NAS polygraph report famously called an unacceptable choice between too many false positives, or too many false negatives.
There is no exit from this trade-off. It results from the universal laws of mathematics as applied to our imperfect world of probabilistic cues with which we attempt to solve signal detection problems. But maybe AI/ML can learn to solve them better.
Add to this structural challenge our usual inferential and interpretive problems: According to leading methodologists, we need to think about causality first in order to make inferences (aka do science). So, at best, black-box ML may present a challenge to science reform. One answer to that challenge would be to say that we need to be learning from the ML results in our causal modeling; maybe computers are better at distinguishing signal from noise in ways we can learn from in an iterative process.
The other usual inferential problem in play here is that we don’t really know what will happen in the future. That makes tallying true and false positives and negatives, guesswork at least until we do (e.g., know which diagnosed/treated cancers were clinically important/unimportant)... If we do (i.e., this may be possible on a comparative country-level but not an individual-level basis).
Bracketing my natural suspicion of something that sounds too good to be true, it would be great if using health claims data to select who to screen could meaningfully reduce both late-stage tumors and overdiagnosis, letting us lessen the pain of that inescapable trade-off by enhancing our ability to distinguish signal from noise in this prototypical signal detection problem. On the other hand, there are three things going on in this paper that bother me.
First, something doesn’t quite add up in the paper’s central numeric claims. Second, we don’t need a fancy algorithm to get better results. We just need to give people (women) better information with which to make their own decisions, like we should be doing in the first place. And third, we still don’t know how these (mis)categorizations would shake out in the real world over time.
Now with 40% more!
At first glance, this looks like a typical example of researchers with perverse incentives inflating their tools’ claimed accuracy rate. It’s a pervasive problem in research including in AI/ML. These claimed accuracy rates then often get misused to promote ever-more and ostensibly better mass screenings for low-prevalence problems despite persistent uncertainties about these programs’ net effects on society.
Here’s the numeric problem with this paper’s claims:
For our main model, we predict whether the screen led to an invasive tumor diagnosis… For comparison, using just age to predict invasive tumor has an AUC-ROC of 0.583. A gradient boosted-tree model model [sic] that includes age, nulliparity, age of first birth, and family history increases the AUC-ROC to 0.585, and adding history of progestogens, estrogens and angiotensin receptor blockers— drugs often found to be correlated with breast cancer—increases the AUC-ROC to 0.589. Compared to a random classifier, which would have an AUC-ROC or [sic] 0.5, our model increases predictive performance by (0.629 – 0.5)/(0.583 – 0.5) ≈ 55% over using just age to predict invasive tumor incidence. Our model also increases predictive performance by (0.629 – 0.5)/(0.589 – 0.5) ≈ 45% over using a model trained on established risk factors (p. 12).
(Terminological clarification: AUC-ROC means area under the receiver-operator characteristic curve. It’s a performance metric for binary classifiers; a visual representation of model performance running from .5 (random guessing) to 1.0 (perfect classification). It does not help us disambiguate between models that prioritize minimizing false negatives versus false positives, and so is not generally considered a good proxy for accuracy even though it’s often referred to as such.)
What’s going on here? The central claim has jumped from 40% to 45% improvement, but that’s not referring to absolute or relative improvement in AUC as a proxy for accuracy. That improvement is from 0.589 to 0.629 — an increase of .04 and about a 6.8% jump. This sounds much less impressive than the claimed 40-45% improvement.
Here’s the trick: The authors define “predictive performance” as the model’s distance from random guessing, i.e., AUC – 0.5. No one is doing this coin flip in real life, making this a questionable metric.
Then, the authors report the relative improvement in that metric. This is a form of rescaling that is nonstandard, inflates perceived gains, and especially distorts the relative improvement when the baseline is small to begin with (e.g., AUCs below .7). This tactic gives the impression of a dramatic boost in predictive quality, when in fact the gain is modest and should be reported as such.
Researchers and policy audiences alike should be cautious when encountering nonstandard performance metrics that are not clearly contextualized. If predictive improvements require special rescaling to sound meaningful, they might not be as actionable as they appear.
Better models don’t require machine learning, just better-informed consent
So where does the first mentioned improvement from baseline come from?
Many mammography guidelines use age alone, and most people assume screenings such as this target “older women” — but don’t realize how blunt a tool that is. It yields an AUC of 0.583. This is way better than chance! But remember, that’s a questionable metric in the first place.
Adding other known risk factors (whether a woman has had kids, age of first birth if so — older is riskier, family history, and hormonal birth control use) increases the AUC to 0.589. That improvement is from 0.583 to 0.589 — an increase of .006. Not that much better than using age alone, but still better.
However, that 0.006 (from 0.583 to 0.589) is only ~15% of the 0.04 gain (from 0.589 to 0.629). So the possible incremental gain from ML is more than 6x larger than the gain from better traditional modeling with known risk factors.
That means there’s still an exciting possible finding here to explore, although we would like to see it (1) linked with a primary concern for causality and (2) translated into clinically meaningful terms (e.g., lives). We would also like to see both those things done for the better traditional modeling, although here we already have some good ideas about causality for every known risk factor.
Even a small AUC gain — like that from age-alone to age-plus-other-known-risk-factors — might be practically meaningful if it translates into improved patient outcomes. But that’s an empirical question.
We need to see the modeling estimates for the relevant models translated into clinical outcomes and tested in the real world. We need to see this both for the fancier model using known risk factors to let high- and low-risk women self-select into and out of mammography with better-informed consent, and for the souped-up ML version.
An invitation to informed consent
Meanwhile, are there any mammography screening invitation materials that actually incorporate the known risk factors information so that women can self-select into high-risk or low-risk screening categories in deciding whether to attend the test? I haven’t seen any. I hope I’m wrong and they exist! But I suspect that most doctors never mention high versus low-risk subgroup categorization modeling when they invite patients to mammography, just as most never mention cancer risks to their patients when they prescribe hormonal birth control.
The Harding Center’s gold standard Fact Boxes exclude the high- versus low-risk subgroup information. Possibly because, once you start doing subgroup updating, risk calculations get harder, fast.
Still, patients deserve informed consent. Especially given that we know* now that these risk factors matter, women have a right to know that they should consider them when they’re invited to mammography screening. (*Or at least, we have correlational and causal reasons to suspect that they do.)
It is furthermore unclear how we would practice informed consent with the proposed ML precision screening. How would doctors deploying new algorithms to sort patients into high- and low-risk categories explain the sorting mechanism? If it’s black-box, would they say it could be wrong? Presumably patients would be free to accept or refuse the categorization, but would they also then be informed about other risk factors to consider? What about expected generalizability limits of this type of tech when it comes to minority subgroups? (This hypothetical conversation is getting long, and that’s not how most medical practice works.)
Overall, this looks, to the jaundiced eye, like another area among many where researchers operating under “publish or perish” have perverse incentives to chase the cutting edge of science. But they’re chasing it where the gains to the people they’re supposed to be serving with their work may actually turn out to be practically quite small or nonexistent. And the system of scientific publishing is rewarding them for doing it, because it’s broken and that’s how it works.
Another, more generous way of interpreting this is to say that Daysal et al may really have made a substantial improvement practically. We just don’t know from this paper because they need to report their findings properly, in terms of possible practical importance, and do a better job highlighting persistent inferential uncertainty and what it means.
What does it mean?
We still have to see what actually happens
Scientists aren’t fortune-tellers. We can’t predict the future, and have to keep watching like everybody else to see what actually happens. To their credit, the authors acknowledge this:
To make claims about (p. 3) long-run health effects, we must estimate how observed outcomes would differ from counterfactual outcomes if women were not targeted for screening. As we do not observe these counterfactuals in the data, these estimates are subject to a causal inference challenge. For a woman whose cancer was caught through screening, we do not know how the cancer would have developed if she had not received early treatment. For a woman who was not covered by the screening program but developed a symptomatic cancer, we do not know whether early screening would have improved her health outcomes. We require some other source of variation in the data to produce credible counterfactual estimates (p. 4).
So we need more data to know whether using ML to tailor who to screen is better.
One problem this highlights is that we still need to ask what we value in deciding what to compare the ML result to, in the first place. Do we compare it to standard invitations to mammography, which critics widely decry as inadequately informing women about the possible risks of overdiagnosis and lack of proven all-cause mortality benefit? Or to a (fictitious ideal) invitation that informs women more fully about how to self-select into high- and low-risk groups? Or to not doing mammography screening at all, but rather in reallocating the same resources to other preventive efforts instead?
The answer here depends on what we care about. If it’s saving lives, we have to look at causality first and consider the whole picture of these screening programs’ effects. We have to count the bodies. It sounds so simple…
What does it mean to count the bodies?
A recent commenter (Elizabeth Fama) suggested I “write a post about why (and when) all-cause mortality is the ideal measure for these sorts of studies.” That’s a good idea for a future post. Elizabeth goes on to take us back to causality, asking:
If an intervention helps a life-or-death condition but doesn't reduce all-cause mortality, wouldn’t we have to argue that the intervention made death from another cause more likely than it would have been without the intervention? (Asked with genuine confusion.)
Maybe, maybe not.
As I wrote recently:
Reasons to suspect that mammography may actually cause net loss of life (not to mention quality of life) include serious risks of harm from overdiagnosis as detailed by numerous critics including H. Gilbert Welch et al, Susan Bewley, and Peter C. Gøtzsche. Those possible harms, highlighted in a recent DCIS trial’s preliminary report and preregistration description, include more radical interventions accidentally spreading cancer — making it more invasive — and very high chronic pain levels at surgical sites (25-68%) 4 and 9 months after breast surgery (Hwang et al, COMET Trial preregistration description).
So overdiagnosis harms like infection from surgery and accidental cancer spreading could certainly cause preventable deaths.
But we could also be seeing something like senescence play a role: People with one possibly life-threatening condition (e.g., a breast tumor that might develop into a clinically important cancer) could also be more likely to develop others in the same interval for shared underlying reasons (e.g., faltering immune function).
So maybe the small number of breast cancer deaths that mammography seems to prevent, are balanced out by a roughly equal number of deaths from other causes. Not necessarily or only from iatrogenesis, but also perhaps from common causes.
This highlights why methodological debates about metrics are not academic. They get to the heart of what works, what may backfire — and how hard it can be to tell the difference.
When we try to assess the success and failure of mass screenings for low-prevalence problems, we have to choose endpoints. This requires a lot of subjective work in selecting, analyzing, and interpreting results. There’s no getting the human observer out of this, even when we try to deal with strictly objective, measurable endpoints like lives and deaths.
That difficulty is exactly why we must insist on scientific communication using practical metrics that matter. That means caring about lives we can count, not inflated AUC gains. Expressing that caring in using endpoints with real-world meaning. And thinking deeply about how persistent inferential uncertainty and endpoint ambiguity may shape the costs and consequences of mass screenings.