Limited Limits
Models seem to offer scientists escape from the bias and ambiguity of language and perception, yet formal methods alone are insufficient for inference
When I was in grad school, I got a speeding ticket driving from Charlottesville to Northern Virginia to conduct the first interview of what became my dissertation research. Because I’m a very risk-averse person, know how dangerous cars are, and was driving the speed limit with drivers flying around me honking and yelling, I was confident that the officer had erred in issuing a speeding ticket. And I so showed up to traffic court to clear it up. There were a number of things that may have resulted in the measurement error I thought had occurred, though driving back to say so was not my idea of fun.
The judge wasn’t too tickled with my appearance, either. His face twisted in scorn as I explained some possible sources of error — so much so that I stopped a few times to ask if I should continue. He told me to go ahead. When I had finished, he explained to me that a lot of law and grad students came into his court thinking they knew the problems in the field, and they didn’t have a clue what they were talking about. But that because the police officer in this case couldn’t prove that he had properly maintained his equipment, he would throw out the ticket.
In reality, what probably happened was that the officer thought he had tagged my car with the radar, but had tagged another — one of the ones flying around me. But his word against mine on that would have meant more to the judge. Was it possible that the officer knew that could be the case too, but had a motive to not say so (like just not losing face)? Could he have had a ticket quota to meet, and seen a petite white woman as the safest person to stop to meet it? Or could he have thought I was doing something dangerous by driving the speed limit with cars flying around me (which is, admittedly, possibly dangerous) — and meant to tell me so? Some people say there are limits to limits; you’re supposed to know when, in traffic and in life, to not actually follow the rules. But even I knew that these didn’t seem like questions one could ask as a matter of social reality. I wasn’t in the question-asking role.
In any case, what actually happened was that my car window’s motor had broken months ago; rolling it down was impossible. So I opened my door when the officer came to talk to me. And in response, the officer unholstered his gun. In other countries, drawing a service pistol is a lot of paperwork; in the U.S., it’s Tuesday. The poisonous cultural soup of cars, guns, and policing in America not having seemed like appropriate courtroom conversation, I was pretty happy to wind up driving home instead of being thrown in jail for just as good a reason as I was stopped and threatened with a deadly weapon in the first place.
Yet, in a way, that scornful old Southern judge was right. Science is a social institution done by human beings, and the tools it offers can help us overcome our limits. But they have their own. And using them, we still have our own. Limits are, well, limited. This doesn’t mean it’s no good trying to do science. But it does mean we should lose our secular religion around it, and try to take two looks at evidence: What we think it says, and what that might say about us. Or, if we must do division of labor: What most people say science says, and how bias, error, and power shape those stories.
This post combines some old and new examples to sketch the outline of what it might mean to do the kind of critical-reflective science that flags the usual errors without ticking off the judge. These examples — a pandemic modeler’s hubris about what models can do, a psychotherapist’s emphasis on grappling with real-world complexity in improving peoples lives, an economist’s law about the distortions metrics produce, and an anthropologist’s attempted break from critical science — all converge on leading statistics reformer Sander Greenland’s application of philosopher of science Paul Feyerabend’s anarchic critique of methods, especially ones that entail automation or make authoritative claims. Greenland’s point in “For and Against Methodologies: Some Perspectives on Recent Causal and Statistical Inference Debates” (Eur J Epidemiol (2017) 32:3-20) is that all methods are only tools to be used when appropriate for a job, with recognition of their limits. Many scientists seem to think and write as if formal methods alone enable inference; they do not.
In a way, this is all straight from my first graduate seminar on research methods (thanks, Professor Freedman). All social scientists (I hope) are taught to triangulate methods to balance the strengths and weaknesses of some with the differing strengths and weaknesses of others, including listening to what ordinary people have to say instead of only looking at numbers. It just extends the argument through different fields and subject matters, and highlights Greenland’s inclusion of some relatively new and high-level techniques (i.e., causal diagramming and sensitivity/bias analyses) in the familiar story, along with his philosophical mooring.
Hubris
“Our verbal model can only take us so far,” writes Kit Yates in his recent book How to Expect the Unexpected: The Science of Making Predictions and the Art of Knowing When Not To, the blurred background for my last post. “If we want to know in advance how to avoid these sorts of backfirings” — Yates continues, discussing how U.S. abstinence-only education probably accidentally increased teen births — “we need quantitative, dispassionate mathematical models to tell us whether we’re going to throw a strike or a boomerang so that we can make informed decisions, in the full knowledge of how a situation will likely play out” (p. 360). In other words, models are like maps we can just follow to get where we want to go. The map is the territory.
This is exactly wrong, and reflects a typical human error of hubris that scientists also exhibit. “The map is not the territory,” according to Polish-American philosopher and engineer Alfred Korzybski (h/t Patrick Burden). Thinking otherwise reflects the common bias of statistical reification. Leading statistics reformer and UCLA epidemiology/statistics professor emeritus Sander Greenland defines this as “treating hypothetical data distributions and statistical models as if they reflect known physical laws rather than speculative assumptions for thought experiments.” As I’ve written previously, this bias often leads to policy-level misreadings of hypothetical projected model outcomes as real-world practical outcomes and values, like misinterpreting estimated false positives in mass screenings for rare problems as net liberty costs to balance against security gains — when no such trade is on the table, and such screenings massively damage society instead. In Yates’ account, this (unselfconscious) reification is just one part of a larger, inaccurate and political narrative about what Greenland calls “romantic heroic-fantasy science.” (See my last post for more book quotes drawing out this critique on that terrain.)
Complexity
Part of the reason it’s hubristic to assert the map is the territory is that reality is more complex than representation; that’s definitional. Maybe this is also hubris (I’m only human); but I often hear it said that nothing is more complex than the human mind, and I could believe it.
This keys into a well-known example of science criticism (i.e., critiques of the way the scientific method is used, as often generating biased and/or inaccurate results): Evidence-based medicine in care for the soul. Existential and group psychotherapy expert, emeritus Stanford professor of psychiatry, and author Irvin D. Yalom writes in The Gift of Therapy (2002; p. 225-6):
The concept of EVT (empirically validated therapy) has had enormous recent impact — so far, all negative — on the field of psychotherapy. Only therapies that have been empirically validated — in actuality, this means brief cognitive-behavioral therapy (CBT) — are authorized by many managed-care providers… Senior clinicians see an apparent avalanche of scientific evidence ‘proving’ that their own approach is less effective than that offered by junior (and inexpensive) therapists delivering manualized CBT in astoundingly brief periods of time. In their guts they know this is wrong, they suspect the presence of smoke and mirrors, but have no evidentially based reply…
Yalom then cites a Weston and Morrison review and meta-analysis that began swinging the pendulum back, chipping at CBT’s claimed evidence-based superiority among psychotherapeutic methods by pointing out common randomized controlled trial problems across medicine, like selection for otherwise healthy patients with one clearly definable problem (which also tends to select for acute rather than chronic cases, which also tends to select against complexity). Such concerns about generalizability travel far across science. Quicker, cheaper, cookie-cutter interventions for simpler cases are easier to measure, so scientists measure them; “publish or perish” and all. That doesn’t make them more effective than longer, more expensive, tailored interventions for more complex cases.
We also usually don’t know as much as we should about longer-term outcomes in general. Yalom concludes:
gains are not maintained and the percentage of patients who remain improved is surprisingly low. There is no evidence that therapist adherence to manuals positively correlates to improvement — in fact, there is evidence to the contrary… Naturalistic research on CVT clinical practice reveals that brief therapy is not so brief: clinicians using brief EVTs see patients for far more hours than is cited in reported research. Research indicates (to no one’s surprise) that acute distress may be alleviated quickly but chronic distress requires far longer therapy, and characterological change the longest therapy course of all (p. 227).
This debate — call it “follow the science” versus “it’s not that simple” — has continued in psychology for 20+ years. The point is not that Yalom is right and proponents of CBT/evidence-based psychotherapies are wrong. These are huge topics that are situated within numerous other debates. For instance, many methodological criticisms of evidence-based psychotherapies apply also to medication therapies for depression. See Bob Whitaker’s criticisms of the STAR-D medication trial researchers as reporting fraudulent results — 76% reported remission with drugs, versus 3% in a reanalysis — and taking money from pharmaceutical companies. You can’t mention Whitaker without mentioning leading medical methodologist and whistleblower Peter Gøtzsche’s work on iatrogenesis including increased risk of suicide associated with anti-depressants. And you can’t mention Gøtzsche’s work on this without coming full-circle back to his meta-analysis on CBT reducing suicide attempts (co-authored with Pernille Gøtzsche).
This is supposed to be the best of the best of evidence-based psychiatry. It’s a hot mess of an article. Here are three quick critiques.
Critique 1: Gøtzsche and Gøtzsche set their terms clearly at the outset as body count: “Suicide prevention is what matters the most when health professionals see patients with mental disorders.” Then, they fail to recognize that the evidence from the trials they review does not establish that CBT benefits patients on these terms. Suicide is too rare for this evidence to establish the presence or absence of that benefit; there are only seven suicides to analyze, and their 3-4 spread (in favor of CBT) doesn’t establish an effect. Yet Gøtzsche and Gøtzsche go on to say “It is therefore now clear that antidepressants increase the suicide risk at all ages while cognitive behavioural therapy decreases the suicide risk substantially.” This gives CBT more suicide prevention credit than it has earned from the trial evidence they reviewed.
Critique 2: G&G’s point is that the evidence suggests CBT reduces suicide risk. This is an ambiguous term. If by suicide risk, one means the risk of suicide (measured in bodies), CBT does not offer established benefit according to the evidence they review. There’s not enough data to say, because it’s such a rare event. Alternately, if by suicide risk, one means the risk of future possible suicide as predicted by factors like reported attempted suicide, then there’s enough evidence here to suggest that CBT may reduce this risk substantially (95% CI .30-.73). But Gøtzsche and Gøtzsche themselves raise questions about the credibility of these sorts of measures. Completed and attempted suicides are different events with different measurement problems.
Critique 3: One logical implication of G&G’s argument is that comparing CBT with “treatment as usual” is problematic insofar as it may compare patients on less psychiatric medication with patients on more psychiatric medication. A quick search for “treatment as usual” and scan of their descriptions of included studies suggests that this is likely the case. But Gøtzsche’s work in this realm is largely about how such medication substantially increases risks of suicide and violence. So this imbalance would, according to his prior, spike the results in favor of CBT — without implying that CBT necessarily exercises any harm prevention benefit at all. CBT may thus function as a placebo in some of the underlying evidence here; randomization to the CBT group may have prevented iatrogenic harm from medication in the control group. (For all we know, a substantial subset of people who attempted suicide in “treatment as usual” groups actually did so with their psychiatric medication.) One would need to account for medication exposures in treatment and control groups to reckon with these possibility — something this analysis does not do (and that may not be possible for reasons of data access/privacy). The findings are thus insufficiently supported by the evidence.
Gøtzsche is one of the best medical methodologists in the world. But everyone is human, also when we do science. His bias got the better of him here.
At a larger level, my concern with focusing on the realm of randomized trial evidence and (apparently) cookie-cutter treatments is that there is no established suicide prevention benefit from such treatments — the best available evidence establishes zero certain net gain. Meanwhile, critical psychiatric platforms like Mad In America highlight widespread iatrogenesis from many facets of the mental health business, not just psychiatric drugs (substantial possible net harm). And Yalom’s critiques of following the evidence-based medicine model in this realm seem to remain largely unanswered.
For example, Yalom might point again here to the issue of possible subgroup effects. E.g., maybe CBT reduces suicide (attempted and/or completed) in people with acute distress, but not those with chronic distress. Maybe the VA (the U.S. Department of Veterans Affairs) should care about this because, say, completed suicides disproportionately involve violent methods like guns, veterans disproportionately have firearms access, the poverty draft means veterans have disproportionately lower socio-economic status, and poverty sometimes tracks with other forms of disadvantage that correlate with chronic distress that correlates with early adverse experiences. Brief, manualized CBT might then fail disproportionately for a substantial subgroup of vets, and deploying it as gold-standard for vets as a whole might produce larger body counts than providing access to manual-less psychotherapy with highly trained therapists... Or not. This problem generalizes beyond vets; you just have to mention vets to get Americans to support healthcare.
Yalom may be right; Yalom may be wrong. We don’t have a way of knowing, because we can’t (practically or ethically) do randomized controlled experiments on the background conditions that are associated with chronic distress — thing like aving experienced abuse, neglect, and abandonment as a child. There is a methodological critique to be written of the mental health literature on this problem, including but not exclusive to trauma and trauma therapies, starting with drawing causal diagrams and coming to (I think) the inevitable conclusion: We don’t have any way of accounting empirically for selection bias problems that may generate victim-blaming and other badness when structural vulnerabilities contribute causally to trauma, traumatic stress, more trauma, more traumatic stress, etc.
So we could offer more caring care than telling distressed people to focus on reliving their worst memories (which most trauma-focused therapies do); and there’s a scientific case for erring on the kinder side of the great unknown that is mental health. That our cultural norms created professional norms that do the opposite is a reflection of our culture. It is perhaps not coincidental that there has been a worsening epidemic of veteran suicide while the VA promotes CBT and exposure therapies as gold-standard, evidence-based treatments for PTSD among other things. The point is not that these sorts of therapies don’t work for anyone — or that we can’t identify some subgroups that they may work better or worse for than others.
Rather, the point is that what looks good on paper when we “do science,” doesn’t always looks as good in reality, when we talk to ordinary people and listen to their complex experiences with the same openness that we try to bring to making sense of empirical reality as conveyed in quantitative data. Reality is messy. “The map is not the territory.” In Greenland’s words:
the complexity of actual context prohibits anything approaching complete modeling, the models actually used are never entirely coherent with that context, and formal analyses can only serve as thought experiments within informal guidelines… A cautious scientist will thus reserve judgment and treat no methodology as correct or absolute, but will instead examine data from multiple perspectives, taking statistical methods for what they are: semi-automated algorithms which are highly error-prone without extremely sophisticated human input from both methodologists and content-experts (p. 16).
The complex reality/simplified model disconnect costs lives. Maybe we should be alert to the disconnect between models and complex realities when we’re dealing with human beings, since it’s predictable and consequential. Maybe the old engineering saying “When brute force fails, use more force,” doesn’t apply to people.
This sounds so obvious as to be stupid. So why does it have to be said?
Perverse Incentives
Because (at best) we’re trying so hard to get things right that metrics are in, discretionary power is out, and when measures become targets, that tends to create distortions wherein we lose sight of the real goal — and we’re too busy and full of ourselves to notice. This is British economist Charles Goodhart’s law. There are many riffs on the general principle, including evidence-based medicine pioneer Alvan Feinstein’s warning about the distraction of quantitative models and other problems in the “evidence” part of evidence-based medicine.
My favorite example is (still) exclusive breastfeeding promotion. Infant feeding research is plagued by selection bias problems. Positive health selection effects (better maternal health → better lactation; more attempted and successful breastfeeding) as well as negative ones (for both moms and babies, worse health → more breastfeeding problems; less attempted and successful breastfeeding) could explain widely touted associations between breastfeeding and better outcomes. There is no evidence establishing beneficial causal effects from exclusive breastfeeding.
Meanwhile, exclusive breastfeeding does common and preventable harm. (There are safer ways to breastfeed; big topic.) Healthcare providers and policy-makers then obsess about breastfeeding and especially exclusive breastfeeding rates as health metrics — not realizing (one hopes) that the metrics miss the point, and they are actively hurting at least a substantial minority of people by doing this — for zero proven benefit.
This is not sui generis; this is health and science policy. There are so many studies and policies that target a specific subgroup with a specific diagnosis (e.g., pregnant women with gestational diabetes, or people with depression) with an intervention (e.g., SMILES’ excellent dietary one) that tries to move their outcomes (metrics), because that’s researchers get funding and publications. Who would pay to just possibly increase everybody’s health instead of probably measurably improving outcomes in a group diagnosed with illness?
One possible answer: Society should pay for this; it would be rational in cost-benefit terms. But maybe people don’t want to be told what to do with their lifestyle choices, even though that’s where the big preventive medicine payoffs are; diagnosing and treating illness is medicine, while giving personal advice is personal. So doctors working on a client-service model don’t do that kind of medicine as much. They’re not trained to anyway, since it doesn’t pay. It’s care work — educating people in how to cook affordable, nutritious foods within traditional dietary paradigms, and such; our civilization disdains care work and then wonders why people are sad, lonely, and sick. But then we can measure their sadness and sickness, and researchers can get money to test protocols to ameliorate them. Take two dumpster fires and call me in the morning.
The point here is that error and bias can get baked into (let’s assume they’re) well-intentioned systems that then try to be logical, scientific, and neutral by maximizing misconceived metrics. Then perverse incentives discourage the humility of worrying that this might be the case, including by recognizing the complexity that one-size-fits-all approaches deny. In the medical system, this often looks like funding for more testing and drug research, where less testing and more diet and lifestyle education and support might do the job better. Even pioneering, well-designed research like SMILES tends to look at specific metrics in a particular subgroup in order to get funding and publications. Scientists respond to perverse incentives to shape their science to the ecosystem that feeds them; of course they do. That’s a survival behavior, not a moral failing. But society pays for this when it means, for instance, that we test all pregnant women for gestational diabetes, generating huge numbers of false positives who then get subjected to invasive and stressful follow-up testing — instead of offering society-wide SMILES to help everyone improve their dietary health, possibly preventing and treating a lot of cases of metabolic problems like gestational diabetes along the way.
So metrics generate perverse incentives which generate distortions in science and science policy, and critical science fights those distortions by at least articulating them in evidence-based terms. But critical science tells a hopeless story of unending human fallibility and non-neutrality. We can take that point and still ask: Where’s the heart?
Humanism
That’s what I understanding anthropologist Anna Tsing to be asking critical science in her essay about the art of noticing. As University of Illinois Anthropology Professor Kate Clancy describes it, that art allows “us to see power, to follow multiple threads without allowing one to dominate. It is an essential scientific practice” (Period: The Real Story of Menstruation, p. 112).
But what does Tsing’s “art of noticing” bring to the idea of adding a critical-reflective scientific lens to normal synthetic-directive science and science communication? (Not a rhetorical question.) If the point is that scientists need to practice the art of noticing, because all methods are tools with limitations, all models are imperfect representations of reality, and the point is to find what works (better) in each case — that is the point of critical science. If, instead, the point is that everyone has a perspective in practicing this art (agreed) — and ours should be somehow more humanistic, or otherwise take on a particular set of values Tsing has in mind (which she might not want to call humanistic, since it deals with collaborative coexistence with planetary ecosystems in man-made crisis; but I would call humanistic because it’s about orienting to a conception of the good that cares about people) — then let’s talk about how, why, and what that value-laden criticism of critical science might look like. What are the costs and benefits of coming from an explicitly value-laden perspective rather than a purely anarchic one?
Or might Tsing or Clancy (or indeed Feyerabend) want to argue that anarchic, critical-reflective science itself is already value-laden? It is certainly an anti-authoritarian stance, though it can be deployed against any authority and is thus not innately partisan or allied with a particular set of values. Then again, neither are traditional scientific methods; they just get used mostly to support the status quo, because that’s the way the cultural currents tend to carry people’s hearts, minds, and pocketbooks, especially with the selection bias of platform. It’s not the methods per se, but how we use them. Normal science uses them normally. Critical-reflective science asks the orthogonal question of where power is in that use. It’s not a scientific, but a practical difference. And asking that question doesn’t make an analyst good or evil. It makes the analysis better, and an analysis can be used for many ends. One could envision a Russian troll factory dedicated to the study and communication of critical-reflective science — on which to base information ops designed to harm adversarial power interests, like decreasing vaccine uptake during a pandemic.
So perhaps science needs humanism like humanity needs a thriving global ecosystem, and that’s another lens that works on a different plane than the synthetic-directive and critical-reflective components. Or perhaps critical-reflective science turns out to be the best humanism we humans can do, since it’s about having humility around our own abilities to act on ostensibly virtuous intent; hold the virtue-signaling, check the receipts. Humanism that emphasizes that good intent might not really add anything either to normal science that proclaims it (in the face of widespread error and bias, and possibly fraud and corruption), or to critical science that shows power shaping the product. You can talk about honesty and goodness all you want, but you’re still one of us stupid humans. (I’m writing this as a way of thinking, not convinced yet either way.)
It doesn’t strengthen the case for a humanistic critique of critical science that Tsing’s essay itself appears to have a critical science problem (it put out possible misinformation as established fact). That doesn’t mean we can’t entertain humanist criticism of critical science; there’s thinking to be done here — and doubtless some has been done that I haven’t heard of yet. But it does underscore how much work there is to be done in science performing this critical work on a level that is more reflective than directive — a point which sits in tension with the directive impulse of humanist criticism. You can have value-laden criticism of science and society, and you can have humility about science as a human enterprise; but the former still needs the latter to check possible error and bias in a way that I’m not sure the latter needs the former to make science better on its own terms. Maybe society needs it. Maybe those two things aren’t as separate as I’m making them sound.
The Limits of Formal Methods
This post is about the false promise of formal methods alone to enable inference. It draws heavily from “For and Against Methodologies: Some Perspectives on Recent Causal and Statistical Inference Debates” (Sander Greenland, Eur J Epidemiol (2017) 32:3-20). The story structure I’m reusing here is a little like “The Sorcerer’s Apprentice”: Initial guide in the realm of alchemy/shoe-making/whatever warns young apprentice of hubris…
Young apprentice grows up to make fine mercury poisoning/shoes/whatever. Eminent sage of alchemy/shoe-making/whatever repeats the same warning — but this time specifically about the latest methods of making better mercury poisoning/shoes/whatever. Many people are somehow surprised by this, even though this was part of the standard introductory training. Then, the initiated re-learn the lesson of humility, because it came from the eminent sage. Ironically, this story illustrates the very social and political reproduction of power that the sage in this case is warning about. No exit!
Ok, the life-course chiasmus and power parts aren’t part of “The Sorcerer’s Apprentice,” but they should be. Because this is not some tangential meta-point about the scope conditions of who gets to talk credibly about the limits of formal methods. It is the essence of limits of formal methods. Maybe they offer an exit from some of the hackability of language. Maybe they offer an exit from some of the other limits of our perspectives and abilities. But they’re still man-made tools that we imbue through using them with that hackability and those limits. Taking steroids can make you a better weight-lifter, but they can’t make you a god. Using causal modeling can help you do better science, but you’ll still be a dumb human.
Back on the article level, in this case, the eminent sage is Greenland; the whatever is science; and the latest methods are causal diagramming and sensitivity/bias analyses. His point (as I take it) is that both are tools. To wit:
causal modeling, like other statistical approaches, is in the end an exercise in hypothetical reasoning to aid our actual inference process, and should not be identified with the process itself (except perhaps in narrow artificial-intelligence applications). Instead, any inferential statistic (such as an effect estimate derived from a causal model) should be viewed as the result of only one scenario in a potential sensitivity analysis, and thus can be (and usually is) misleading without reference to results derived under other scenarios (p. 8).
In other words, don’t be hubristic; reality is complex, models simplify it, and that’s an irreducible tension. But do use the best available methods to do better science. That usually means causal modeling, because (as Greenland continues):
potential-outcome models can express most of the causal concepts and bias concerns I have encountered in practice. Whether it is always helpful to express problems in such formal terms involves considering the trade-off between the precision benefit and the attention to detail bias modeling requires, along with the risk of over-confidence that the mathematics may generate from an overly narrow bias model (Greenland p. 8, citing Poole and Greenland 1997).
It must be said that it boosts one’s credibility in criticizing, contextualizing, and re-situating appropriate use of cutting-edge methods, when one helped invent them. And that, to be backtrack and issue a correction already, the Sorcerer’s Apprentice is not really about all human hubris getting the young assistant in a jam; it’s about someone who doesn’t know how to properly use the magic he conjures, making a mess. Experience and competence don’t hurt, but nor do they make any formal method alone — even causal modeling — sufficient for scientific inference (Greenland p. 9, citing Broadbent et al. 2016). Science isn’t magic. Even DAGs:
As with other causal models, inferences from causal diagrams come with validity guarantees only when the diagram represents an experiment (perhaps natural) in which the input variables (root notes) were independently randomized, and subsequent selection into the analyzed data set (not just the study) was also independently randomized — as in a perfect if complex trial with perfect subsampling. Provided a selection variable S appears in a conditioned node in the diagram, this feature makes a diagram excellent for illustrating what may go wrong with various study designs and adjustment procedures. But, as with other devices, any doubts about the diagram’s many null assumptions make it much less reliable for telling us what we can safely infer about the targeted reality. Especially, without actual randomization of a variable, the complete absence of an effect on it from a potential direct cause is usually unsupportable with data, and rarely credible for complex possible direct causes with many avenues for effects (e.g., age, race, and sex).
In other words, draw a DAG, include the selection variable, and use it to check for problems in studies and analyses (e.g., collider bias). But don’t use this tool to argue a particular set of evidence suggests an absence of racial bias or something like that, because reality is too messy to establish that sort of thing.
So that’s causal modeling — useful but limited, like any tool. What about sensitivity and bias analysis? The same terms and conditions apply. They are still tools with human operators. Still simplify complex realities with often dubious assumptions. And so we still need to critically read their results with attention to possible bias and error, especially in terms of hubris and uncertainty aversion.
Cutting-edge tools don’t substitute for this essence of the critical-reflective lens; there is no substitute. They don’t externally verify the mapping of the model onto reality, either; usually, nothing does. This verification problem really limits what we can establish scientifically in ways a lot of people just don’t want to admit. To wit:
An ongoing concern is that excessive focus on formal modeling and statistics can lead to neglect of practical issues and to overconfidence in formal results (Box 1990; Greenland 2012a; Gelman and Loken 2014b). Analysis interpretation depends on contextual judgment about how reality is to be mapped onto the model, and how the formal analysis results are to mapped back into reality (Tukey 1962; Box 1980)… assumption uncertainty reduces the status of deductions and statistical computations to exercises in hypothetical reasoning… this exercise is deceptive to the extent it ignores or misrepresents available information, and makes hidden assumptions that are unsupported by data. An example of the latter is the assumption of reporting honesty… (p. 13)
As critiques of the STAR-D trial show, statistical mistakes and possible scientific misconduct are a problem. The hope was that the right tools might ameliorate these sorts of problems. They might help well-intentioned people do better science. But we’ve still got to beware the same basic problems of complexity and uncertainty.
Unfortunately, sensitivity analyses are themselves sensitive to the very strong assumptions that the models and parameter ranges used are sufficient to capture all plausible possibilities and important uncertainties… Thus, while sensitivity and bias analyses are a step forward from traditional modeling, like all methodologies they are no panacea and should be approached with many cautions (p. 14).
More generally, models are just tools that some practitioners misuse as magic.
Whether informal guidelines or formal modeling technologies, all inferential methods are practical aids with strengths and limitations, not oracles of truth (p. 14)… Controversies over the role of formal, algorithmic methods are nothing new, surprising, or unhealthy. They seem to be aggravated however by treating methodologies as overarching philosophies rather than as technologies or tools” (p. 15).
But hey, this all sounds very anarchic — like it introduces critical possibilities into logical discourses, potentially throwing sand in all sorts of gears. Chaos! Exactly…
The Punking of Philosophy of Science
The title of Greenland’s “For and Against Methodologies” pays homage to punk philosopher of science Paul Feyerabend’s Against Method (seen here in concert).
Image credits: Gbfoverlake2023’s “Feyerabend, Kuhn, Hoyningen-Huene and colleagues after seminar at ETH Zurich” plus torso and guitar from Dave Cackowski’s “Craig Onstage in the '80s V220 Red Stripes (Dave Cackowski)” with glue stick, non-toxic fingerpaint, and gel pen.
The book is in the mail (no really), but Greenland relays Feyerabend’s message — conveyed first in his class in the early 1970s, and then in book form — thusly:
Every methodology has its limits, so do not fall into the trap of believing that a given methodology will be necessary or appropriate for every application; conversely, do not reject out of hand any methodology because it is flawed or limited, for a methodology may perform adequately for some purposes despite its flaws. But such constrained anarchism (or liberal pluralism, if you prefer) raises the problem of how to choose from among our ever-expanding methodologic toolkit, how to synthesize the methods and viewpoints we do choose, and how to get beyond automated methods and authoritative judgments in our final synthesis (p. 3).
Confronting this problem requires examining history, and grappling with psychosocial as well as logic factors, Greenland suggests.
Tools: Problem or Solution?
There is a debate in my circles about whether language net helps or hurts communication. It relates to the one we could have here about whether formal methods help or hurt inference. They’re both about representation, and how it lets us make leaps — for better and for worse.
The answer is, of course, it depends. Both. Tools pose problems and offer solutions. Language, and its reduction to the crema of a particular sort of formal model, can both revolutionize our abilities to think things through including structurally and socially — and create new vulnerabilities for bias and error (both logical and psychosocial). There is no free lunch from fallibility and perspective. YMMV. Try it on the road. Triangulate methods and see what you make of the disconnects. The usual caveats apply — and they are recognizable from my superb but standard introductory training in graduate research methods.
It must be said, to end at the beginning, that guns, too, are tools often confused with philosophies. I leave this connection as an exercise to the reader.