The corn dies, along with two young babies. The villagers suspect you of witchcraft. You profess your innocence. They instruct you to chew some rice powder to clear your name. It comes back dry: Guilty!
But never fear! Scientists are here to defend you! With counter-accusations that “lie detection” is… Racially biased?
I conducted NSF-sponsored Ph.D. dissertation research on racial and confirmation bias in polygraphy. And I’m not so sure. But I suspect this is the wrong answer. And the wrong question, too.
Whittaker et al’s recent review claims racial bias in polygraphy is an established fact (“Racial biases in polygraphs and their legal implications,” Nat Hum Behav 9, 3–4 (2025); open-access preprint). Their claims lacks sufficient evidentiary basis and reflect sociopolitical bias in the scientific and broader discourse. Here’s a brief rebuttal. My PhD research on this topic did not establish racial bias or its absence in polygraphy. A future post will say more about it and share the data.
The authors claim skin conductance response (SCR) has technical features that worsen criminal justice inequities. The available evidence does not establish this.
Acknowledging uncertainty would correct the mistake: different average racial subgroup responses may or may not cause lower average SCR among Black people. This may or may not contribute to possible racial bias in polygraphy, which may or may not contribute to well-established racial disparities in criminal justice.
Whittaker et al then claim that Black people’s polygraph results are disproportionately likely to be interpreted as inconclusive. Again, this effect is possible but unproven. I know something about all this from my diss…
What my research found (and what it didn’t)
As a grad student, I spent years requesting polygraph data to analyze for bias from numerous federal agencies under the Freedom of Information Act and eventually a Knight Foundation-sponsored lawsuit. Secrecy blocked science.
There may be something to the racial bias concern. For instance, I got some relevant equal opportunity complaints released — and featured in McClatchy Newspapers in 2012 and Wired in 2018.
But individual complaints of discrimination don’t prove systematic inequality. There could be misperceptions of discrimination. There could also be individual cases of true discrimination in the absence of a systemic problem — a “bad apples” problem.
If racial bias systematically affected polygraphy in a way that mattered for real-world consequences, then we would expect to see evidence of disparate impact. And we would expect it to comport with some systemic explanation(s), like prejudice and/or psychophysiological differences.
My research didn’t find definitive proof of such impacts in difference-in-differences analyses of nationwide police departmental (LEMAS) data. It didn’t find definitive proof of such prejudice among Virginia state licensed polygraphers in a by-mail survey. Nor in a series of Amazon Mechanical Turk survey experiments on possible interpreter bias on technology-mediated decisions including polygraph chart interpretation. And it didn’t find definitive proof of such bias in the form of a stereotype threat effect at the physiological level against liberal politically active Black college students taking mock polygraphs, which was designed as a tough test against the null of no bias. (To the contrary, they had an apparent advantage!)
That may sound like proof that racial bias doesn’t affect polygraphy. It’s not. “Absence of evidence is not evidence of absence,” in Altman’s terms. Yet, I had been professionally trained to apply that fallacy by doing null hypothesis significance testing, among other accidental methodological misdeeds. So I made the mistake of looking at whether my findings disproved the null hypothesis of no bias; they did not.
But why would one suspect possible racial bias in polygraphy in the first place? Whitaker et al claim there are racial subgroup differences in SCR. They cite Bradford et al 2022, who write:
individuals who identify as African American or Black can appear to have lower skin-conductance levels and smaller conditioned responses than non-African American/ Black individuals (Davis & Cowles, 1989; Janes et al., 1976; Johnson & Landon, 1965).
Assuming these differences exist, we don’t know whether they matter in polygraphy. Bracketing that for a moment, why might Blacks appear to have lower SCR?
Sweat, melanin, and evolution
It’s possible that this is a physiological phenomenon: lower average subgroup sweat drives lower SCR. But why?
One possibility is evolutionary adaptation in sweating and heat regulation, a hypothesis I haven’t seen anyone else lay out (though presumably someone has).
SCR and measures like it are basically measuring fingertip sweat. Sweating less may have been adaptive for people who lived in hotter, drier places, lest they become dehydrated and die. But wait, don’t we need sweat to cool?
Yes. But melanin (the pigment that makes skin darker) likely compensates for this reduced sweating in part through its conductivity by getting rid of more heat with less sweat. The body produces more melanin in response to UV exposure, as well as possibly to heat, increasing the adaptive effect as needed.
If more melanin and less sweat tend to track together for evolutionary reasons, this would create racial subgroup differences in physiological responses relevant to polygraphs. (The frame of reference in this description has been reversed, with darker-skinned hominins thought to have migrated out of Africa in the wave that left modern descendants in Europe, the Levant, and Asia perhaps 60,000 years ago.)
As Bradford et al note:
Early research identified phenotypic factors that could affect EDA measurement fidelity, including number of active sweat glands (Boucsein, 1992; Kawahata & Adams, 1961; cf. Thomson, 1954; Wesley & Maibach, 2003), thickness of the upper layer of the skin (Berardesca & Maibach, 2003; Johnson & Corah, 1963; Weigand et al., 1974), electrolyte content of sweat (Johnson & Landon, 1965), skin resistance (Johnson & Corah, 1963; Juniper & Dykman, 1967), and skin temperature (Thomson, 1954).
But no one wanted to talk about how darker skin itself could at least partly explain the apparent racial subgroup difference. How cool melanin is — literally.
This may seem counter-intuitive, because we tend to think darker colors will make you hotter by absorbing more sunlight. But it’s more complicated than that. Maybe people in a lot of hot places tend to wear dark clothing for a reason. Maybe people adapted to those environments tend to have dark hair for a reason, too.
The role of SCR in polygraphy
So racial differences in SCR would make causal mechanistic sense — though it would be nice to see a table with ranges of possible effect sizes along with data and sample characteristics. But, assuming they exist, these differences don’t have established field implications in polygraphy — where SCR is just one channel among three.
It’s possible that lower average Black SCR dampens the reaction differences used to interpret polygraph tests. This could matter practically. The polygraph community generally agrees that electrodermal activity (EDA, the inverse of SCR, aka galvanic skin response, GSR) is the easiest-to-measure channel, the most important to scoring accuracy, and by far the most frequently and heavily relied on in human and computer polygraph scoring (see, e.g., Blalock, Cushman, and Nelson 2009; Capps and Ansley 1992; Handler et al. 2010; Kircher and David C. Raskin 1988; Kircher et al. 2005; D. J. Krapohl and McManus n.d.; M.T. Orne, Thackray, and Paskewitz 1972).
In contrast with this overwhelming consensus that EDA most shapes polygraph outcomes among physiological measures, a federal polygraph examiner handbook states “The pneumograph, electrodermal, and cardiovascular tracings are analyzed separately and given equal emphasis in the decision process” (Department of Defense, Counterintelligence Field Activity 2006). Polygraph studies tend to indicate respiratory responses either do not matter at all, or matter the least among measures and then only sometimes, in supposedly distinguishing guilty and innocent subjects (Barland and David C. Raskin 1975; Dawson 2000; Podlesny and D.C. Raskin 1978). The vast majority of federal polygraphers ignore or misinterpret respiratory data in their scoring decisions (Kircher et al. 2005). But does this really matter?
Polygraph scoring is a mess. Continuous responses are approximately dichotomized through varying systems of adding positive and negative integers to what are usually quite small differences in responses, and then summing these integers to determine the overall score. Polygraphers typically add larger integers (in absolute value terms) to differences in responses they perceive to be larger according to subjective rules of thumb such as the “Something-Versus-Nothing Rule” (Maschke and Scalabrini 2005). These integers are then added at the end. Technically, the result is a categorical variable with three categories which are some variant on pass, fail, and inconclusive. Inconclusive results are actively minimized, and their data routinely omitted from analyses.
As a methodological matter, this is bad practice. But here it means that, even if Blacks do have more “inconclusives” in the lab (which isn’t established), that may not generalize to field conditions characterized by scoring practices intended to disappear inconclusives.
It’s possible, however, that digitization has changed all this in recent years, or will. EDA/SCR/GSR is visually comparatively easy to read: it can look like big spikes instead of little squiggles. So when polygraph scoring no longer relies on analogue visual assessment of channel response differences, we might expect it to no longer rely disproportionately on this channel. When it’s more reliant on algorithms, we might expect egregiously unscientific scoring practices to become more standardized.
(This is not to say that algorithmic polygraph chart interpretation is scientific. The whole test is still based on the false premise that imperfect proxies somehow turn into perfect signals for truth/deception when combined — when such signals are not known to exist. And/or on the complementary myth that “good enough” proxying produces desired aggregate effects — when simulation suggests that aggregate effects of mass screenings for low-prevalence problems like spying undermine the security they’re intended to advance.)
Bigger statistical problems
It’s possible that there are racial subgroup differences in SCR, and that they cause more inconclusive polygraph tests among Blacks. This would be systematic subgroup variation in ambiguity.
As a class, problems with this structure implicate recent work on statistical fairness criteria (e.g., Hedden’s “On Statistical Criteria of Algorithmic Fairness”). This literature finds that different fairness criteria are incompatible. Hedden presents an idealized example algorithm in which systematic subgroup variation in ambiguity causes violation of all criteria except Calibration Within Groups, resulting in more mistaken classifications for the more ambiguous subgroup (see previous). This mathematical incompatibility raises normative questions about value prioritization (e.g., whether we want more accurate or more equal algorithmic decisions).
Bigger proof problems
It’s not clear what scientific test could establish racial bias in polygraphy in the field. That’s because it’s not clear what scientific test could validate lie detection in the field, period. This was public statistician Stephen Fienberg’s take while he was co-chairing the National Academy of Science’s polygraph report committee. The best we could do, as a a result, is probably establishing disparate impact, if it exists — itself contested as a form of discrimination.
Bigger science problems
Some of Whittaker et al’s cited research invokes replication concerns that pervade science, but that trouble the bias literature in particular. Many relevant findings (e.g., on stereotype threat) have failed to replicate. The authors do not appear to recognize these concerns.
Why care?
Whittaker et al argue that “any tool used in the criminal legal system must be thoroughly evaluated for its potential to perpetuate and/or increase, versus lessen, racial disparities.” This claim reflects (some) current sociopolitical attitudes, but appears to be ignorant of recent scientific literature (e.g., Hedden’s “On Statistical Criteria of Algorithmic Fairness”) on how “a number of intuitively attractive statistical criteria of fairness are not jointly satisfiable except in marginal cases” (p. 10).
This literature implies likely conflict between increasing decision accuracy in contexts where that supports norms of justice as fairness (e.g., bail decisions) versus ensuring some forms of justice as equality (e.g., equal false positive and negative rates between racial subgroups when their base rates or ambiguity levels differ). Under the universal laws of mathematics, you can’t have it all. (See my brief explanation of how this incompatibility stems from Bayes’ rule.)
In the end, however, Whittaker et al make the right practical point for the wrong reason: “We join with others’ calls to exclude polygraph evidence from the legal system entirely, which have focused on poor overall reliability and validity…” There are plenty of renowned scientists and professional polygraphers alike who would agree.
But the authors then go on to “contend that these issues are likely to be amplified for Black individuals who are already systemically disadvantaged by the criminal legal system.” The best available evidence is insufficient to establish or refute this claim.
Finitude
It is strategically unwise and tactically unnecessary to peg the success of forensic and other reforms to unproven bias charges. Researchers need to think hard about the terms we want to set in trying to contribute meaningfully to society. Advancing the scientific evidentiary basis of forensics in particular and policy in general arguably makes more sense than chasing subgroup effects.
In mathematical terms, bias research divides finite attention and other resources. If polygraphs are (or are not) biased against Blacks, then what about Hispanics? What about women? What about other out-groups?
In the bigger picture, is justice as equality the form of justice we most want to advance in our short lives, with our limited attentional and other resources? Might we instead care more about some other metric — like accuracy or efficacy? Given that we can’t have it all, is focusing on bias what we really want? What “we”?
This line of questioning invokes Isaiah Berlin’s concept of value pluralism. First principles can conflict, requiring prioritization. This prioritization is political. Scientists can’t claim to solve difficult ethical dilemmas for people who disagree, or to work outside our sociopolitical contexts in some kind of neutral bubble. Such a bubble does not exist.
Bias research, like axing its funding, is political. And that’s ok, as long as researchers know it. But it’s got to be evidence-based, too.