Rethinking Risk - the Free Online Risk Calculator
What the world needs now is better risk communication -- but what does that look like?
You’re sitting in the waiting room where the doctor sent you to think it over, unsure what to do. You text a friend, saying so. You search the Internet. But you don’t feel like you get the risks you’re weighing. Your friend texts you back, sharing a free online risk calculator that helps you count bodies to get a better sense of what probabilities mean, and understand who’s most at-risk for what, instead of hearing about aggregate risks that apply to no one. It doesn’t give you an answer, but it helps you find yours.
I’ve been thinking more about this risk calculator idea from my second-to-last post. Counting the bodies is a keeper; people need to see numbers in frequency formats in order to make good statistical inferences. But we also need to think about how to represent uncertainty in outcome classifications, and uncertainty along other dimensions; otherwise, estimates look more confident than empirics warrant, and people get hurt. In this sense, counting the bodies is hopeless — we don’t have the information. It’s also pointless: values, feelings, and experiences — not numbers — drive many if not most major decisions. But that’s ok. It might still be worthwhile to offer people better information for decision-making even if they don’t use it. Or they might use it in tandem with shortcuts instead of statistical inferences. Alternately, maybe helping people rethink risk to make better decisions should focus exclusively on bolstering these heuristics to help people get a larger number of decisions less wrong. How automation fits into all this, I don’t know; but I have some idea how it could go wrong…
Count the Bodies
I learned this the Gestalt (accidental) way from Stephen Fienberg’s NAS polygraph table: If you’re going to show people numbers, a confusion matrix showing frequency counts of all* outcomes works better than probabilities at triggering correct statistical inferences. This was, unbeknownst to me at the time, an application of what was then cutting-edge decision science: People can generally intuit what numbers mean when we count bodies, and generally can’t correctly calculate the implications of percentages for risk assessment purposes.
This insight comes from “How to Improve Bayesian Reasoning Without Instruction: Frequency Formats,” by Gerd Gigerenzer and Ulrich Hoffrage, Psychological Review, 102(4) 1995, 684-704. The idea is that we evolved seeing counts (counting bodies), not doing out probabilities or percentages. So when we get mathematically identical information in these different formats, we do a much better job assessing risk with the former (historical exposure) than the latter (evolutionarily novel) format. They’re not psychologically identical, even though they’re mathematically identical; and the format makes a huge difference in usability.
In other words, it’s not (necessarily) that our statistical intuition is flat-out terrible, as Kahneman and Tversky had it. Gigerenzer — director emeritus of the Center for Adaptive Behavior and Cognition (ABC) at the Max Planck Institute for Human Development and director of the Harding Center for Risk Literacy in Berlin, and a leading risk communication researcher — has done a lot of work showing it’s at least partly, rather, that scientists kept testing people in terms of probabilities. And that’s not how we evolved to think smart, fast or slow. So there is tremendous utility in counting the bodies to help people make better-informed choices when it comes to common medical interventions like contraception, mammography and other cancer screenings, and psychiatric drugs. But it takes a lot of time to do one of these examples well.
So rethinking risk this way probably means going deep instead of broad. It would take a lot of people working together to make a good collection of risk analyses on this model, and still the collection would be probably have to be fairly limited in its subject matter foci. So this looks like a wiki, maybe with a PubPeer-style section for post-publication discussion and proposed revision.
Counting the bodies like this to make a risk calculator has bigger problems than resource-limited scope. For one thing, this type of tabulation acts like all cases can be sorted into binary outcome bins, which is not always true. Sometimes — oftentimes? — there are uncertain outcomes, too. And they get left out of (or wrongly baked into) categorical analyses, distorting frequency and probability figures alike.
Mind the (Uncertain) Gap
A common methods criticism of much published scientific literature is that it often uses categorical analysis, making binary variables of continuous realities. This throws away a lot of data and can distort results for no pay-off. It’s just a familiar way of doing computation that used to be hot, because numbers, woo. Evidence-based medicine pioneer Alvan Feinstein warned of “the distraction of quantitative models”; his warning remains valid. Following bad science just because it contains numbers makes bad policy, and should stop. But people are overloaded and keying on the trustworthiness heuristic of numbers = objective = good and true (I think).
For example, infant feeding studies often cite correlations between breastfeeding and outcomes like child health and educational attainment to argue “breast is best.” But existing evidence doesn’t establish that breastfeeding has a causal effect at all. And it often just doesn’t work that well, so current norms discouraging caretakers from feeding hungry babies formula appear to be doing a lot of preventable harm. One of the methodological problems in the infant feeding science backing these norms is treating breastfeeding as binary, or chopping it into time duration categories, instead of letting it be a continuous variable like it is in real life. Another, related mistake is ignoring why women stop breastfeeding or added formula. This means omitting accidental starvation as a confound, because women often report delayed or insufficient milk as the reason they start using formula. So continued breastfeeding could proxy for breastfeeding that worked well enough that the babies whose mothers kept breastfeeding were lucky enough to avoid starvation (selection bias); but it wouldn’t make headlines to suggest that maybe protecting infants from starvation is good. It would, however, radically change infant feeding norms to recognize that and act accordingly on the precautionary principle. In sum, infant feeding science routinely produces numbers that look precise and certain, and that are used to promote current policies — but that do not establish causal effects or measure what needs to be measured, instead chopping bad data into fictional categories to keep generating widely misinterpreted results that seem to (but don’t) support policies that contribute to common and preventable harm. Infant feeding science and policy are a mess.
Babies are messy, but it’s not just feeding babies. Reality is messy. Fienberg’s polygraph table* followed the convention of the field by omitting uncertain results. Polygraph results aren’t just “pass” or “fail,” but also “idk.” Ignoring this grossly overestimates accuracy. (That was one of many very generous assumptions the NAS committee made in favor of polygraph proponents, and their analysis still concluded the National Labs shouldn’t use mass screening; another was the 80% accuracy figure that was never intended to be an actual estimation — only a very generous assumption for hypothetical calculation purposes.)
The bigger methodological issue here is that cutting out real-life messiness screws up analyses. One man’s “All models are wrong, but some are useful” is another’s “Sensitivity and specificity are one-sided or conditional versions of classification accuracy. As such they are also discontinuous accuracy scores, and optimizing them will result in the wrong model." George Box’s opponent here being Frank Harrell, Professor of Biostatistics at Vanderbilt University, Associate Director of the VICTR Research Methods Program, Co-Director of the Study Design Core, Trial Innovation Center at the Vanderbilt Institute for Clinical and Translational Research, and author of the cult statistics textbook Regression Modeling Strategies, referred to by many a stats consultant as their Bible.
You don’t have to read a statistics textbook to get why ignoring uncertainty is bad science that hurts people. At the human level, I key this into some experiences with and later critiques of rheumatology and diagnosis AI cut-offs closing sick people out of needed treatment for no good reason. For instance, when Batu et al used Adamichou et al’s lupus diagnostic aid (SLE Risk Probability Index, SLERPI) on kids, they raised the diagnostic threshold to ostensibly improve accuracy (better specificity; fewer false-positives), accepting more false-negatives (worse sensitivity) in the process. But in real life, clinicians should probably treat patients who score a 7 using this tool the same as they should treat those scoring an 8. Increasing the tool’s accuracy like this looks good on paper, but stands to hurt people in practice. Both Adamichou et al and Bantu et al also used patient data excluding uncertain diagnoses and assuming lupus is a binary instead of a continuum (e.g., with Antiphospholipid Syndrome toward one pole and Undifferentiated Connective Tissue Disorder NOS, or “lupus lite,” toward the other). This inflates accuracy and excludes a lot of people who could benefit from better care. You don’t necessarily want to treat possible false-negatives any differently from how you treat true or false-positives, when it comes to offering low-risk treatments and preventive care to sick kids. Ignoring uncertainty — instead of thinking about how to give a large number of people with uncertain diagnoses better care — is especially bad in rheumatology, a field notorious for difficult diagnoses.
But omitting uncertain cases from diagnosis accuracy figures is pervasive (bad practice) in medical and other research more generally. So we have to think hard about the implications of uncertainty, not unthinkingly re-code it as one of these binaries — even or perhaps especially when the stakes are high. And doing this is not a current norm, so it takes extra cognitive and emotional work to cut against the grain of business as usual.
Doing better science to help more people means sticking to the truth, and the truth is that reality is messy. Deal with it. Don’t torture the data into making it prettier than it is. The only people that helps is scientists publishing papers with higher accuracy figures than they should. And proponents of programs that get funding on the basis of those bogus figures.
Ignoring this one type of uncertainty (among many types of uncertainty) is not the only way accuracy gets inflated…
Counting the Bodies is Hopeless and Pointless (But That’s Okay)
When I explained the risk calculator to a tech guy, he thought it was like how he envisions augmented reality in five years: Put on your Google Glasses, and you can see the risk of a terror attack on your favorite Tel Aviv café before deciding if you’ll go out or stay in; or be reminded how much safer flying on a given plane is than driving an analogous route on vacation. Risk, risk, everywhere! Overlaid with reality in accurate, tailored, up-to-date probabilities to inform more rational decisions.
This is not the idea. First, probabilities are the wrong way to show risk information if you want people to understand it (see Gigerenzer, above); counting bodies works way better. I don’t doubt that probabilities are a neater way of making something look sciencey! And maybe they even seem more immediately comprehensible. Perhaps people also prefer them if you focus group them and ask? I’m sure a user interface team somewhere is winning this argument right now. It will just lead to bad decisions, is all. It’s like the Rolling Stones said: You can’t always get what you want, unless a corporation focus groups to see what that is, and then you don’t necessarily get what you need, because tech and science are actually different industries, and the hands don’t always talk.
Second, we don’t have the information to do this, and that’s not about to change. This is a consensus view in studies of information one might broadly call informatics that I share, for example, with Ralph Hertwig and Ido Erev. In “The description-experience gap in risky choice,” Trends in Cognitive Sciences, Vol. 13 No. 12, 2009, they write “this fully described world is both unattainable — because of lack of information — and a caricature…” (p. 522).
Hertwig, director of the Center for Adaptive Rationality at the Max Planck Institute for Human Development in Berlin, and another leading risk communication researcher, suggests the point is moot, since this isn’t how people view risk, anyway. They do mostly experiential estimates — how common is this killer in my circles? This is also known as using the availability heuristic. These risk assessments can be pretty accurate.
Except, that is, when it comes to black swans — statistically rare but high-impact events. From dust we come, to dust we shall regress to the forgotten mean. History rhymes, but we missed the first stanza.
Sometimes, people also have preferences that don’t flow from body counts, but rather from qualitative things, like preferring a “better death” to a “worse death.” I was very aware of this doing out the breast cancer/contraception example in that next-to-last post. In the acres of junkyard, there were several versions before I caught the denominator error in the postpartum breast cancer medical literature that was swinging the net death risk to the pill compared to withdrawal. So using withdrawal looked like it gave mothers better overall survival odds than using the pill. Seemed like big news? Let’s say around 1/1000 mothers seemed to be dying preventable deaths from metastatic breast cancer due to hormonal contraception. But that small net death risk came in exchange for preventing a lot of births — 1800 in a sample of 10,000. Again, it was based on an erroneously inflated postpartum breasts cancer incidence, and the death risk seemed to net out in my final calculations.
The point is that these calculations explicitly raised the issue of how people have a right to make their own, varying decisions based on qualitative ideas and values — to live different sorts of lives and risk different sorts of deaths. These are equally existential choices. But they don’t show up in net death risk numbers. Values drive the science bus.
These issues are less visible when the death risks net out. But they’re still packed in there, in the bits of risk making up different possible deaths resulting from different contraceptive choices. Maybe no metastatic breast cancer death risk at all is acceptable to some mothers, because that’s a different kind of death from relatively sudden, private death from pregnancy/birth. Or maybe overall survival odds are what you care about on behalf of your kids, so you do look to the net death risk number. These are personal choices. They’re about values, even when we try to strip things down to life or death in black and white.
Point being, even the best-intentioned risk tables can seem to make invisible the interpretive choices they contain and the values they embody. And that’s ok because, again, mostly, people don’t really make maximally rational decisions based on the best comparative risk tables they can find, anyway. So if the point is to improve decision-making with better risk communication based on what we know in science — but could stand to translate more broadly to help people as a matter of science communication as public service — could there be a better way of doing that? Maybe one that incorporates what we know about heuristics?
Keep It Simple
Heuristics are shortcuts people can and often do use to make better decisions given imperfect information. We know so much more about heuristics than I know! And I’m into this sort of thing. So we know much more about heuristics than most people know, and that’s a gap that technology could help close to improve decisions in a much broader way than a wiki-style body-counting platform. This is the big “broad versus deep” fork in the risk calculator idea.
Helping people with Pomodoro-like simplicity to enact what we know works for better risk management? Maybe that could work.
But it goes against my impulse to hunker down calculating and helping other people do that, too. I want to see the bodies! I want to know what the right answers are. And, sometimes, that’s been my weakness. I want to count the trees, but we need to see the forest. I want the quantitative net existential risk bottom line, but it’s going to be wrong and may also not be as useful. I want the rabbit hole, but people need a quick summary in the light.
Then again, maybe this fork is not even a fork. Maybe better risk communication should be doing both: more body-counting and better heuristic-priming to help people make better decisions.
Either way — broad or deep — the promise of tech here is not just sharing information widely, or helping people collaborate from far away to build something cool together. It’s automation…
Automate Wisely
My favorite risk calculator automation fantasy lets you put in your specs (like age and sex in Isabel's Symptom Checker), put in your options (like mammography screening or no in Harding's example, or hormonal contraception versus withdrawal in mine), tailor it some more, and get out the basement of what you should get to know to make an informed decision. Presto change-o, now we have a modicum of that informed consent we’ve been hearing so much about! But doing this across lots of subject-area terrain with automation is probably impossible.
One reason is that it takes real human beings reading the scientific literature well to know what it says. Because much of it is terrible. This easily generates a classic “Garbage In-Garbage Out” problem where bad science is considered on par with good science, and what comes out reflects that.
This is what happened with Consensus, an app another friend sent me a while back that meant to use AI to make science more accessible, but that missed the causal revolution and the need to dig into methods (aka the “science crisis”) more broadly. Maybe I’m biased because I like reading PubMed; but there’s no substitute for reading PubMed. If you want to know what it says, you can have to actually do the work and make sense of a vast swath of information yourself. The reason is that you can’t trust what the research literature says it says, or what institutes with policy positions say it says, and synthesize that as truth. That’s a telephone game. And there’s too much error for it to shake out.
The same is true, by the way, of qualitative source material like the Human Relations Area Files. There’s no shortcut to spending a year of your short life reading about your topic of interest in this lovely database. AI can’t do this for you. You can’t just take what people say at face value without considering the source and context. There is no ready-made quick-sum tailored to your needs. That’s just another telephone game.
Across so-called hard and soft sciences, quantitative and qualitative materials, automation is not a “get out of thinking” free card. Still, there are things computers can do easily that are really hard for people, and vice-versa. And we often don’t know what’s possible until we try. So it’s fun to think about how maybe automation could revolutionize the way we think about risk, after all.
I’m envisioning a love-child of Pearl, Thomas Nagel, and the three wise men of rethinking risk in eastern Germany — Gigerenzer, Hertwig, and McElreath. Something that combines interest in calculating and communicating better science using better methods, on one hand, with emphasis on shortcuts, experiential wisdom, and broad applications, on the other. Giving people the quantitative, existential bottom line, with all its limits and caveats, flowing from focusing on causes, at-risk groups, and comparative net risk estimates instead of aggregate single option effect estimates. But letting go of context enough for it to work beyond medical contexts, like in everyday life (e.g., with travel options or pandemic behavior choices), law, and business. Cueing helpful heuristics as well as marshaling evidence and translating risk numbers into more readable formats. One of Gigerenzer’s points, at least as I take it, is that we don’t to choose between frequency counts and heuristics to support better decision-making; they’re both about priming cognitive styles that get better results.
Still, an alternative is stripping the concept down to plugging in heuristics and Bayesian thinking for people as they walk down a decision path in widely varying contexts, to help them think about (hmm) thinking about risk in a way that might help them make better decisions for themselves. Dropping numbers and going for cognitive style made simple. Making a calculator without a topic that doesn't give you an answer, but that you can still tailor by circumstance to get better help? Sounds like a fancy horoscope of a sort. But when the stars help us see, maybe they help us make better decisions for ourselves, in spite of ourselves.