Everybody's Got Bias Problems Except Me and My Science
What research questions should we prioritize? How did my dissertation get it so wrong? And could sharing the data now still do some good?
“Everybody’s got something to hide except me and my monkey.”— Lennon-McCartney.
Bias researchers often present their work as an objective pursuit of apolitical Truth. But what if the biggest bias problem in bias research is in science itself? What if the current politicization of National Science Foundation bias research funding — which the Trump Administration seems to be in the process of axing as part of a much broader sociopolitical and economic agenda — is just a pendulum swing away from a particular set of norms that had been setting an empirically misguided agenda?
At least it must seem to some observers, including professionals targeted by bias research, that bias researchers following sociopolitical norms painting science as objective — pretending to be perfectly neutral automatons instead of human beings engaging in a human enterprise — are acting as if “everybody’s got bias problems except me and my science.” Assuming they’re onto something, what’s the alternative?
Admitting scientists are human beings — social and political animals with unavoidably limiting perspectives who make mistakes — seems like a good start. Our perspectives can condition every step of the scientific process, from what questions we think are worth asking, to how we structure studies to answer them, code data to analyze, structure analyses, interpret results, and report the whole process. Indeed, trying to do so in ways that conform to prevailing social and professional norms is often considered a central part of the job — although here there seem to be perennial tensions between conformity and quality.
There are perverse incentives in particular for younger scientists to just do things the way they seem to be done, even when norms lag behind best practices. Of course, we can also just not know what’s right when there are discrepancies, even when we want very much to do it. These discrepancies and uncertainties can multiply and magnify the already substantial uncertainties in scientific evidence. This was my dissertation experience.
As a grad student, I assumed polygraphs were junk science, bias seeped into tech-mediated decisions including polygraph chart interpretation, psychophysiological effects could drive ascriptive biases in polygraphy, and polygraph programs might also select on authoritarianism, because ambiguity creeps out authoritarians, some polygraph questions are built ambiguous, bigger physiological responses to those questions can help people pass. The questions I prioritized were informed by my priors as well as my values, my advisers’ values, and our sociopolitical context. And I accidentally misinterpreted the evidence I gathered to test hypotheses stemming from these priors. In this, I had good company.
For one thing, I think I uncovered evidence that police polygraph programs might save lives by reducing police violence. But didn’t make much of it at the time, in part because that so contradicted my priors — and in part because the analytical method that produced that result contained so much uncertainty as to be of questionable utility by itself. But the result actually makes sense if the intelligence community was right and the National Academy of Sciences was wrong about polygraph programs.
(NAS concluded the programs would net harm security at the National Labs based primarily on estimated hypothetical outcomes stemming from an application of Bayes' rule (see Table S-1A and B), which implies programs of this mathematical structure (mass screenings for low-prevalence problems) tend to backfire under conditions of rarity, persistent uncertainty, and secondary screening harms. But this application assumed the programs had only one causal mechanism. They have three: detection, avoidance (deterrence), and elicitation (the bogus pipeline). We don’t know how the effects of these mechanisms interact and net out in the field. And we may not know how to scientifically validate the first mechanism, as NAS polygraph report committee co-chair Stephen Fienberg rightly insisted. But we can validate the latter two.)
For another thing, I collected evidence that some tech-mediated decisions could indeed be vulnerable to some forms of bias. But again I didn’t make much of it, this time because I made a few common methodological mistakes…
A few of my favorite mistakes
One of my favorite things to do for fun on the weekends is critique the rolling tide of bad medical literature for making stupid mistakes like misinterpreting statistical significance test results (reflecting the cognitive bias of dichotomania), not assessing effect sizes in terms of possible practical importance, and not thinking through causality. These are a few of my favorite mistakes. At least, they are now that I know how to recognize and correct them when other people make them.
When I defended my NSF-sponsored bias dissertation, over a decade ago, I didn’t know any better and made all these mistakes. Here’s an example correction…
Polygraph chart interpretation Experiment 1 results were misinterpreted as null racial bias findings, but actually show possible racial bias against dark- versus light-skinned blacks and Hispanics, as hypothesized.
However, it would be difficult, if not impossible, to assess this finding’s field generalizability in terms of disparate impacts, not least because reported racial subgroup data is typically not disaggregable in terms of skin color. This concern about the (im)possibility of addressing these experiments’ artificiality joins with other relevant problems with much racial bias research more broadly — such as unanswerable causal questions, statistical fairness criteria incompatibility, and ir(reproducibility) — to cast doubt on the wisdom of my overall research design, starting with the questions. Alternately, one could argue that it was valid, well-designed research with practically important implications that just didn’t produce certain knowledge one way or the other.
Similarly, effect sizes were interpreted in terms of statistical significance but not practical importance. For instance, the (statistically significant) confirmation bias effect in polygraph chart Experiment 1 meant that, for a chart associated with a neutral background investigation summary, there were 340 “no deception indicated” (NDI) calls versus 284 “deception indicated” (DI) calls; for the same chart associated with a negative background, there were 252 NDIs versus 332 DIs.
The magnitude of this difference would seem to be of practical importance. But how should that be translated into real-world terms? For one thing, again, generalizability cannot be established. What practical import does a big effect have if it’s just in an MTurk experiment? Arguably, it has possible practical importance. But it has no established practical importance whatsoever. One could argue this was valid research that just couldn’t produce certain knowledge by design.
The same limitation applies to the above-mentioned racial subgroup results in the same experiment: For the same chart associated with a different photo on the background investigation summary, there were 111 NDIs versus 146 DIs for a white subject, 122 versus 104 for a light-skinned Hispanic subject, 111 versus 115 for a dark-skinned Hispanic subject, 144 versus 120 for a light-skinned black subject, and 104 versus 131 for a dark-skinned black subject. These possible racial subgroup differences might be of practical importance if they generalized to the field. But there seems to be no way to know if they do, or not. Much less to know what they would mean in relevant real-world units, if they did.
As for thinking through causality, it seemed reasonable at the time to conduct 4-6 interpreter bias survey experiments to estimate possible bias on one side of the equation (extending here to a few additional case studies of tech-mediated decision-making), a few psychophysiology lab studies to estimate possible polygraph bias on the other side, run difference-in-differences analyses on national police data, and FOIA/sue a bunch of federal agencies trying to get more such real-world data released (a mostly failed effort) in order to get at the field interaction between possible interpreter and subject biases.
But, in retrospect, this effort was inadequate because it didn’t even attempt to draw out the causal logics of the polygraph itself — an exercise that would have revealed, among other things, that the relevant National Academy of Sciences report had reached an erroneous conclusion. It also would have shown that subgroup differences could work along each of the polygraph’s three separate causal pathways, individually and in combination. All this would require further thought if we wanted to properly assess the possibility of bias in polygraphy. This has not been done.
Possible racial bias in polygraphy
Racial bias in polygraphy is possible but not established, even though my dissertation research involved many efforts to estimate it. This is why I debunked Whittaker et al’s recent claims of racial bias in polygraphy, which they did not like.
I’m no longer interested in trying to settle the question, believing it to be unanswerable and the wrong question anyway. If I ever “finish my dissertation” (reinterpreting everything correctly and publishing articles/a book out of every bit of research as I suppose I properly should have done), I wouldn’t approach its questions, such as this, as ones that my research could definitively answer. At the time, I thought that was my job, and felt terrible that I couldn’t do it… When not being able to do it was just a matter of rightly seeing the limitations of science in relation to the evidence at this time.
But the record should be correct on this point: We do not and cannot know based on the best available evidence whether biases such as racial, gender, or confirmation bias (or some interaction thereof) affect polygraphy systematically, or not. Probably the best we could do is test for disparate impacts — a contested legal definition of bias — in a particular institutional setting.
Polygraph program nontransparency has historically blocked efforts to assess relevant data for such impacts. That doesn’t mean we can assume the programs institutionalize bias. It means we don’t know and we’re not able to find out. This nontransparency is political.
“Political” as a dirty word
Part of what scientists are reacting against when we try to paint ourselves and our work as objective, neutral, and certain — all of which tend to be material misrepresentations — is the current cultural conception of “political” as a dirty word.
When I conducted a by-mail survey of Virginia state-licensed polygraphers in 2010-2011 as part of my dissertation research, one subject wrote in: “young lady, I don’t need to take part in your survey. I thought the questions were political, and I’m not going to fill it out, nor will I send it back.” The survey had indeed included questions about partisanship, including two American National Election Studies questions about racial attitudes with which many respondents expressed displeasure.
Another polygrapher informed me his colleagues in the polygraph industry (APA) and defense establishment thought I was a spy shortly before I experienced an overwhelming array of multi-platform attacks. Around this time, the data from the polygrapher survey appear to have gone missing from my hard drive, and my back-up drives were wiped clean. (I had other backups.)
Thus, the most interesting finding of my dissertation research turned out to be one I could not write up for an academic audience. It’s easy to say that most scientific and popular writing serves the preferred narratives of powerful sociopolitical networks. It’s hard to prove just how costly challenging them, instead, can be. That power asymmetry is of course political.
In American and scientific cultures, “political” is still widely considered a dirty word. Scientists have been accordingly slow to acknowledge our human problem, preferring instead to narrate the shocking reality of science turning out to be full of human error and perspective (being made by humans) as a crisis to be solved, instead of a fact to be understood.
But the scientific discourse has improved a lot since then in terms of acknowledging widespread problems of uncertainty laundering and other common cognitive biases in science. It has also made good headway building out the open science infrastructure that did not exist yet when I defended my dissertation. For this reason among others (i.e., having moved on), it seemed technically impossible at the time to post my dissertation data and materials online, but seems relatively easy now.
Data and materials
are here now to the best of my abilities.
This research was conducted over a decade ago, and I don’t have access to my old university email or some of the software I used. (Yes, I used Stata sometimes as a student; but I didn’t inhale.) I’ve done my best to upload the right versions of everything important. If you have questions or want to see something that’s not there, just ask. If you notice that I have made a mistake, please tell me.
This is a mess, but better than not sharing it at all.
Why the delay in sharing the data and materials?
I wanted to publish everything all together, as should be done as a minimum standard; I had had difficulty knowing technically how to share everything right away, as should be done as a higher standard. But I never had time to publish it at all, unless you count media collaborations, or the dissertation PDF on my website.
The main reason is that I was dying to get out of dodge (and my grad school depression), got a job, and Skyped my dissertation defense on the cross-country drive to start my NSF postdoc at UCLA Psychology. There, I was tasked with serving someone else’s publication agenda (a bad idea for all concerned), moving cross-country again to Harvard after six months, and then flying around the country pitching a national policing database and washing my underwear in hotel sinks while Ferguson burned before leaving academia and the U.S.
A lot of things have improved since then, including open science infrastructure. But I still occasionally have a nightmare that I need to finish my dissertation.
Why is this not a scientific article?
It is around 70% done.
Methods
I used every method I could get my grimy graduate hands on, including:
videotaped interviews,
a by-mail survey,
a grand total of seven online survey experiments,
two lab-based psychophysiology studies involving mock polygraphs based on federal polygraph protocols,
the results of years of requesting federal polygraph program data and documents from multiple agencies,
an extensive set of difference-in-differences analyses on nationwide police (LEMAS) data, and
open-source materials.
Of all these methods, I now find the synthesis of the interviews and open-source materials most compelling. This is also the most obviously interpretive exercise. I think that reflects bias in social science in favor of quantitative methods that appear to be more objective and less political, but that are often neither.
A note about the Amazon Mechanical Turk survey experiments: At the time, I was the first person in my department to do MTurk studies. I was proud and happy that I figured it out myself, and relieved that it solved my data problem — so I could mostly stare at my computer and avoid dealing with people. The main problem was that I couldn’t get data to assess bias in polygraphy, my main interest. But this was the proverbial looking for the keys under the lamppost “because that’s where the light is.” I don’t think much of the generalizability argument for any of this stuff.
Analysis
I was never confident in my statistical analyses, and now understand that doubt as a correct assessment on three levels. First, I was repeatedly advised to express more certainty in interpreting my findings than the evidence itself provided (nullism). Second, I was forced by a collection of controversial departmental hiring and tenure decisions to form an ad hoc committee of nice, smart people who could not evaluate my quantitative work. Of the four external stats readers I then recruited, three were wrong (one of whom went on to stalk me). Third, I trusted the incorrect majority in misinterpreting some of my results. That was my mistake, though I do think I did more than my due diligence in trying to prevent exactly this sort of misinterpretation from characterizing my work.
Stalking aside, another interesting para-finding was that all of my outside stats readers had different preferred analytical approaches and interpretive views. At the time, this made me worry all the more that I, myself, was incompetent. But now I see this (also) as a symptom of larger problems in the field…
In the garden of forking paths, different researchers can make different, more or less defensible analytical choices with the same dataset. There are so many researcher degrees of freedom in most problems that, if you’re anxious to get it right, you’re likely to be disappointed if you try to get someone else to meaningfully check your work. Because they will get a different answer.
This is a more widely-understood problem with better possible solutions now than when I was dissertating. See, for instance, Dorothy Bishop and Charles Hulme’s Sept. 2024 “When Alternative Analyses of the Same Data Come to Different Conclusions: A Tutorial Using DeclareDesign With a Worked Real-World Example” (Advances in Methods and Practices in Psychological Science, Sept. 2024):
Recent studies in psychology have documented how analytic flexibility can result in different results from the same data set. Here, we demonstrate a package in the R programming language, DeclareDesign, that uses simulated data to diagnose the ways in which different analytic designs can give different outcomes… DeclareDesign… can simulate observational and experimental research designs, allowing researchers to make principled decisions about which analysis to prefer.
It’s great that this and other tools like it are part of the discourse. But, at the same time, simulating different analyses like this doesn’t actually solve the basic interpretive problems of statistics. The reason is that these accrue from the basic interpretive problems of science. And those are often irreducible. Scientists have perverse incentives to deny this, including because our paychecks may depend on producing answers to questions that are unanswerable.
What needs to be done next?
(Re-)Analysis and interpretation. Since the contemporaneous writing, I have learned so much more about statistics and science than I did in grad school, that I wouldn’t publish any work from that era. I know this runs counter to all sorts of incentives and norms, but I’m really glad I didn’t, as I needed more distance and education. It didn’t feel right. Sometimes, the file drawer problem is that people want to get it right.
There would be so many more things that would bother me about my old work, if I had any time to revisit it. A new one more popped into my head recently after reading Kleinberg et al 2018 (NBER Working Paper 23180, aka “Human Decisions and Machine Predictions,” Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan, 2018. The Quarterly Journal of Economics, Vol. 133, No. 1, p. 237-293). They found people tended to underweight previous criminal record (p. 41). Whereas I found evidence of confirmation bias ran in the expected direction in an MTurk polygraph chart interpretation survey experiment. It may be worth studying confirmation bias in tech-mediated decisions more, as I’m sure a lot of smart people are doing.
Why I won’t be doing it at the moment
Other commitments take precedence — a newborn and a four-year-old. In addition, I have come to see polygraph programs as one example of the larger class of mass screenings for low-prevalence problems. Within this area of interest, the Chat Control case study takes priority, as it has tremendous sociopolitical significance in the context of security forces’ ongoing struggle to implement cheap mass surveillance amid modern states’ perpetual efforts to protect liberal democracy by structuring effective techno-legal constraints on such infrastructure.
Conclusion
This is incomplete research executed by an imperfect human in an imperfect world. I no longer think it addresses priority problems in wise ways. But, at the time, it was part of my best effort to do just that.
At least the reader might celebrate the infrastructural improvements in institutional science that made sharing the data possible. At most, people with irreconcilable differences might agree that integrity isn’t about getting everything right the first time. It’s about correcting mistakes and being transparent — especially about the uncertainties that pervade much of science and statistics, just as they pervade much of life. Such openness is how we move forward.
Acknowledgments
This research was funded by a National Science Foundation Dissertation Improvement Grant, Louise and Alfred Fernbach Award for Research in International Relations, Albert Gallatin Graduate Research Fellowship, U.Va. President’s Fellowship, and Bernard Marcus Humane Studies Fellowship. Its NSF funding might have been flagged for possible termination under recent Presidential orders if it were ongoing today.