Algorithmic Hype, Perverse Incentives, and Watching Retraction Watch
Bias in bias science isn't just about the inescapability of perspective, but also about the perceived value of work with the "right" kinds of bias
Limits. There’s no exit from being fallible creatures of our psychosocial milieus with limited cognitive-emotional abilities to get the meta on our own perspectives. Like me and my Dixie twang, we often can’t hear ourselves like other people can, since we can only toggle among other perspectives — not achieve perfect neutrality, linguistic or otherwise. This enmeshment of consciousness in context presents tail-eating problems in science. Problems of turtles on turtles, of problems with problems — of bias in bias science.
The first post in this bias-in-bias series highlighted bias in bias research (Mugg and Khalidi) on bias research (Gigerenzer) on bias research (in cognitive psychology and behavioral economics writ large). The second looked at bias in science and science communication on masks and respirators, focusing on Jefferson et al’s 2023 Cochrane review and Damasi/Gøtzsche’s critique of Oreske’s critique thereof — especially on how the idea and rhetoric of objectivity do the work of power in this discourse, and that work is gendered. This post examines research integrity problems in research integrity research (McIntosh and Vitale 2023) on abortion research (Coleman 2022) that dissents from the preferred establishment narrative (a narrative previously summarized and critiqued in my abortion myths series here: parts 1, 2, 3, and 4).
The current scientific consensus on abortion is that it’s not psychologically risky for women; in contrast, the evidence suggests it’s associated with substantial increases in mental health risks including suicide. Coleman’s dissenting article critiqued the standard consensus narrative citation of the Turnaway study along these lines. I also critiqued Turnaway here; while typically misinterpreted by abortion proponents as proving that abortion risks no mental health harm, it actually showed substantial possible increased risk of depression associated with abortion.
Coleman’s article was targeted for retraction “due to ‘undisclosed competing interests... which undermined the objective editorial assessment of the article during the peer review process’ (Coleman 2022). The retraction took months and involved legal proceedings (Marcus 2022)” (McIntosh & Vitale, p. 13). The process, not the substance, was the issue, and the impetus for the journal’s post-publication review of that process was social media posts by pro-choice scholars about relevant pro-life affiliations (i.e., Manuel “#AbortionIsHealthcare” Galvan and Chelsea B. Polis, affiliated with the pro-choice Guttmacher Institute since 2014). The Retraction Watch blog that McIntosh & Vitale cite (Marcus 2022) violated basic standards of journalistic work and neutrality by failing to point this out. Should someone be watching Retraction Watch?
Science isn’t two-hand-touch. You shouldn’t be able to get someone a penalty in the yard just because you don’t like which team they root for in the Major League. People are going to develop expertise in areas they have opinions about, and opinions about areas in which they have expertise. Because science is full of interpretation by people who are social and political animals, these opinions will have social and political dimensions. In hyperpolarized discourses where dichotomization rules the day, sometimes these opinions will conform to consensus, and sometimes they will dissent. When they dissent, they likely draw greater scrutiny than when they conform. This would not be unusual in science among other walks of life. But the way we treat dissent, as philosophers from J.S. Mill to Paul Feyerabend have observed, has far-reaching implications for critical thinking, and thus for science and society. Subjecting dissenting research to heightened scrutiny and penalty on the basis of researchers’ political associations is inconsistent with the values of freedom of thought and equality of opportunity on which science and other rational discourse in liberal democratic societies are based.
There’s also a regress problem with the asymmetry here: If the scientific evidence really does suggest that abortion may substantially harm women, then why shouldn’t experts on it also be members of pro-life organizations without necessarily disclosing that as a conflict of interest when they engage in scientific publishing activities? This goes back to the issue of dissent, because analogous affiliation with pro-choice organizations is being treated differently than affiliation with pro-life organizations — because it comports with the current, erroneous consensus and is within the realm of normal in every sense. It’s not a standard disclosure in abortion research, and there seem to be no related retractions of record (specifics below).
As a case of bias in bias science, the issues this case raises have broad implications for disclosure and methodology across domains. “Transparency (openness) and neutrality (balance, fairness),” in the words of Sander Greenland, “seem widely valued” — but researchers can “have very different concepts of what the values entail in practice” (“Transparency and disclosure, neutrality and balance: shared values or just shared words?” Sander Greenland, 2012, J Epidemiol Community Health; p. 967; full text). Since this case centers on algorithmic processes, it also has implications for AI ethics. And since I just posted about another case of ethics problems in ethics research, and am situating this one back in the bias-in-bias context, I might as well point out that this is that pot-kettle phenomenon in the explicitly moral domain: Ethics researchers behaving unethically. That it happens all the time is not a special indictment of ethics researchers (although it may be an indictment of science as a human enterprise, and humanity as prone to stupidity and evil). But it is a flag that some science reform efforts may disproportionately target dissenting research.
So while the point of the previous bias in bias science posts holds — no one is perfect; there is no “no-bias” bias (or other) research — we turn now to a wrinkle in the meta-science admonition that we are puny mortals doomed to fail due to pervasive cognitive-emotional limitations: There is a difference between an honest mistake and a dishonest one. Bias in bias science isn’t just about our pesky inability to attain perfect neutrality as a matter of perception (i.e., stupidity). It’s also about morally imperfect actors responding to perverse incentives to exhibit kinds of bias that are socially and politically encouraged. And those incentives likely drive unfairness in post-publication review that hurts science.
Case Study: McIntosh and Vitale 2023
On December 11, 2023, Leslie D. McIntosh and Cynthia Hudson Vitale published an article entitled “Safeguarding scientific integrity: A case study in examining manipulation in the peer review process” in Accountability in Research, Dec. 11, 2023. The article says it selected its case study, a special issue:
due to irregularities detected by algorithms that analyze trust markers in published research. These algorithmic anomalies alerted us to one article (Coleman 2022) to manually review. In short, the study purpose (i.e., hypothesis) did not have the language typical for research in this area, and the lack of funding statement garnered suspicion.
In an email to Vitale, conversation with McIntosh, and follow-up message exchanges with McIntosh, I repeatedly asked for more information about these algorithms and trust markers. How do they work? How reliable are the results? Is modal hypothesis language proxying for trustworthiness?
McIntosh (MPH/PhD) responded for both authors. McIntosh is VP of Research Integrity at Digital Science, which acquired her start-up Ripeta, a company that developed and sold algorithms intended to “[detect] trust markers of research manuscripts.”
McIntosh repeatedly contradicted herself and the published account of case selection. Ultimately, McIntosh didn’t answer my questions and cut off correspondence.
Case Selection
In response to repeated questions about why or how the algorithms flagged Coleman 2022 as described in McIntosh & Vitale 2023, McIntosh indicated that the article was not purely or even mostly flagged by algorithmic processes. In a Dec. 15 conversation, she said she didn’t know how she picked the Coleman article to review — that “we do a lot of manual curation” in general, and that in this case she was manually exploring, curating, and “very little relied on the algorithm.” She went on to say the tool focuses on the structure (syntax) of hypotheses rather than their content. When asked whether the tool might disproportionately flag opposing views, and specifically whether she had done any research on this kind of concern, she did not provide a substantive response.
In a Dec. 15 follow-up message, McIntosh said that her response on specifics of hypothesis syntax had been irrelevant to this case study selection, and asked that I:
please note that the parallel to finding this article is similar to a case from last year https://retractionwatch.com/2022/08/24/how-a-tweet-sparked-an-investigation-that-led-to-a-phd-student-leaving-his-program/. When something hits the news (used to be Twitter) and I have the time to investigate, then I do. I’m always testing to see what we can and cannot attest to with Trust Markers. Also note, what I looked at was an entire special issue, not just the Coleman article. I would appreciate that nuance being mentioned. As long as that quote is used in context, I am okay with it.
In a Dec. 22 follow-up message, I again asked for more information on this question:
could you please tell me how you selected this case study, what irregularities the algorithms detected at what point in your manual curation process, how the hypothesis language had atypical language for research in this area, and what proportion of studies in this area lack a funding statement?
I noted:
there appears to be a discrepancy between what the article says about how this case study was identified, and what your message says. Did you select the Coleman article as a result of new coverage about its retraction, as your message implies? Or did it come to your attention instead as a result of trust marker algorithms? I would appreciate any further clarification you might be able to provide.
In a Dec. 22 reply, McIntosh again offered no substantive answers to case selection questions. She wrote:
I’m not sure why these clarifications are important for your article. Actually, I don’t know what your angle is.
Many of your questions appear to be answerable with some more research on how algorithm development works and some of the nuances. Specifically look at development from syntax rather than semantic analyses.
The last question about how I selected the article - it was not due to the Coleman article and that article had not been retracted at the time of exploring the special issue.
On January 10, 2024, I messaged again, repeating my questions about case study selection as well as questions about the tool’s accuracy, reliability, and validation that McIntosh had not answered in conversation or messages:
Did you select the Coleman article as a result of news coverage about its retraction, or did it come to your attention instead as a result of trust marker algorithmic processes? If your article’s description of the Coleman article’s selection as a case study is inaccurate, do you plan to issue a retraction or correction? If, on the other hand, your article’s description of the case study selection process is accurate, then what irregularities did the algorithms detect, when in your manual curation process, how did the hypothesis have atypical language for research in this area, and what proportion of studies in this area lack a funding statement? I’m also still wondering whether modal hypothesis language proxied for trustworthiness, and what if any checks there are to ensure that the algorithm does not disproportionately flag dissent.
Did you hear back from your engineers regarding the tool’s accuracy, reliability, and methods employed for its validation?
On Jan. 11, McIntosh replied:
After reflecting on our conversation and how I prioritize my time, I am going to decline further discussion. The information on the algorthms [sic] that is public is in the white paper. I responded to your questions about the paper. I would encourage you to re-read the methodology on why we looked at the issues of conflicts of interest.
There are multiple, conflicting case selection narratives offered by McIntosh & Vitale 2023, McIntosh in conversation, and McIntosh’s messages. Prominent abortion researchers like retired Bowling Green State University Professor of Human Development and Family Studies Priscilla Coleman, whose work threatens the pro-choice consensus narrative, are easy targets. To single Coleman out for criticism on the basis of dishonestly reported use of nontransparent algorithmic processes with unestablished accuracy, reliability, and validation is not science. It is high-tech bullying.
It also illustrates two common misuses of AI and other technologies: (1) in mass screenings for low-prevalence problems (MaSLoPP) without assessment of their dangers, and (2) in confirmatory evidence-seeking without recognition of its flaws. McIntosh’s apparent business model illustrates the first misuse. McIntosh’s self-reported case selection of Coleman 2022 illustrates the second. (It’s worth noting that both mass and targeted screenings have appropriate uses, moored in recognizing the implications of probability theory and applying scientific evidentiary standards.)
Both (mis)use categories cannot apply to the same case. Either McIntosh & Vitale 2023’s published case selection description is accurate, and this is a case of mass screening of published articles for a low-prevalence problem of poor trust marker indicators, albeit using a tech with undisclosed accuracy that appears to lack validation. Or, the article’s case selection description is inaccurate, as per McIntosh’s other statements, and this is instead a case of confirmatory evidence-seeking that promotes the tech and services from which the authors make money. It does this by adding criticism to an already embattled author and article (Coleman 2022) that expressed an unpopular dissenting view. Either way, this is bad science and bad ethics. Here’s why.
Category 1 - Mass Screenings, Massive Dangers
Last month, I gave a Chaos Computer Club talk building on several previous posts about the dangers of mass screenings for low-prevalence problems (“Chat Control: Mass Screenings, Massive Dangers,” FireShonks, Dec. 27, 2023). It reiterated my usual rant that mass screenings for low-prevalence problems are often doomed to backfire according to the implications of probability theory. This is a dangerous structure under common conditions of rarity, uncertainty, and secondary screenings harms. Because the rarer the thing of interest, the more likely the common (false positive) is to overwhelm the rare (true positive) in mass screenings; under conditions of persistent inferential uncertainty (not knowing for sure which are true and false positives and negatives), we can’t validate tests, and so don’t know how accurate they are; and secondary screening harms can net outweigh the benefits of catching true positives, both directly (when mass and secondary screenings harm the very people these programs intend to protect) and indirectly (when mass and secondary disambiguating screenings cost finite resources needed to respond to originally specific concerns about the problems of interest).
This is all true even for “objective” problems, like cancer. It’s true even for the worst imaginable crimes, like child sexual abuse. It’s true across diverse contexts, no matter how accurate or shiny the technology, because it stems from the structure of the world. Usually, there’s not a one-to-one correspondence between what we want to identify, and the cue(s) we use to try to identify it. There are probabilistic cues instead. This invokes the accuracy-error trade-off in statistics; we will always make mistakes in categorizing the world (this is not our fault), and so we will always be trapped in a trade-off between maximizing and minimizing different types of mistakes (so-called type 1 and type 2 errors, or false negatives and false positives). This is an unchangeable implication of universal mathematical laws that techno-solutionist narratives deny at society’s peril.
I’ve written previously about the potential dangers of mass screenings for low-prevalence problems in security and medical contexts, but what about McIntosh and Vitale’s sector — education? The mathematical structure of the programs and thus the associated probability theory problems are identical across domains. That’s why, for example, Vanderbilt University announced on August 16, 2023 that it “decided to disable Turnitin’s AI detection tool for the foreseeable future.” Turnitin is a U.S. company that sells software licenses to universities and high schools to screen student assignments for plagiarism and AI use in writing. It sounds like Vanderbilt still uses the software for plagiarism screening, just not its AI detection tool. That’s at least a partial win for logic and fairness; for AI use in writing, there’s research suggesting AI designed to catch it is biased against non-native English speakers (due to smaller vocabulary size).
What does all this have to do with McIntosh and Vitale? On Dec. 13, 2023, in response to my questions about their article, McIntosh sent me “Digital Science White Paper: Introducing Dimensions Research Integrity,” Powered by Ripeta — Leslie D. McIntosh, Ruth Whittam, Simon Porter, Cynthia Hudson-Vitale, and Misha Kidambi, Feb. 2023. This paper defines the signals the algorithms at issue look for, “trust markers,” as “the explicit statements on a paper such as funding, data availability, conflict of interest, author contributions, and ethical approval” (p. 1), spinning them as “a new type of article metadata representing the integrity and reproducibility of scientific research” (p. 4).
This tech addresses a signal detection problem — the most abstract level of any type of problem where we are trying to figure out whether a signal is there in some noise, or not. And, as is typical of this type of problem, these are inherently imperfect, probabilistic cues. In this case, they are also implausible, unvalidated cues proxying for poorly defined outcomes (signals). Integrity is hard to define, and reproducibility should theoretically rely for validation on actual reproduction — which remains rare, and even without consensus definition (National Academies citing Barba’s 2018 review).
This bad proxying logic is pervasive in contemporary science and tech. Its alternative is mooring to causation. In order to be able to learn if we are right or wrong in matters of classification, we have to model causal generative processes first, as the upper-echelon consensus of methods pioneers (especially Judea Pearl and Sander Greenland), philosophers of science (like Nancy Cartwright), and other leading science reformers (like Richard McElreath) emphasize. To paraphrase Greenland, we need more science in our science.
So McIntosh et al’s “trust markers” — a euphemism for unproven, wishful proxies for sale — exemplify pseudoscientific business as usual. The authors of the white paper, including McIntosh and Vitale, stand to profit from selling the tech and related services the paper promotes. At least one of the intended use cases is mass screenings for low-prevalence problems — checking research for trust markers and intervening “before publication, if needed” (Figure 2, p. 3) at the institutional level.
Institutions have deep pockets, so of course tech snakeoil salesmen target them with crap like this. And that pitch makes most sense in terms of prevention. No institution wants to pay to find out more and better about its people publishing bad research after the fact. From journal editors and publishers to research institutes and co-authors, people usually work pretty hard to avoid that. So the incentives are structured such that people hawking educational AI like “trust markers” for “research integrity” will probably tend to pitch preventive screenings when they are trying to make sales.
This is problematic. As often with these sorts of screenings, McIntosh et al don’t define exactly what problems they’re screening for, how many of them there are, or how common they are. We would need to know those things to calculate hypothetical outcome estimates of the sort people need to see to get a sense of whether these programs offer net benefits or harms. McIntosh et al also don’t say whether or how they have checked the repeatedly referenced 33 million studied publications for classification accuracy, what they know and do about false negatives and false positives, or how they’ve attempted to validate their results (which they claim to have done across 10 different fields, p. 6). This is not sui generis in common misuses of algorithmic processes. It is still alarming.
It is not enough to say that you validated your results. Science requires transparency in reporting how. Particularly in cases such as this, where it is not clear that the results were actually validated at all; and it’s not clear that they can be…
The validation problem in theory
It’s not clear how to validate proxies for doing ethical science. Validation is a common problem with mass screenings for low-prevalence problems. It stems from not knowing what we need to know, and not being able to find it out. This uncertainty has three parts.
Rarity
We would like to know the true base rate of things like misdeeds, but can’t generally learn it in the same way we can figure out the true base rate of things like clinically significant diseases by triangulating data on diagnosed cases and related deaths in countries with and without such screenings, and undiagnosed cases on autopsy. Dishonest research problems fall into the former category, where we wouldn’t expect to be able to learn the true base rate of research fraud or corruption according to various definitions by asking people, because people have incentives to lie. So we don’t know how many cases of how many subgroups there are to classify in the first place.
Ground truth
The second part of uncertainty giving rise to the persistent validation problem here is about whether the test result (algorithmic classification) is right. When we don’t know how to verify whether someone whom a test flags as guilty really is or not, as in the case of lie detection, then we don’t know how accurate the test is in the real world, and we can’t find it out under real-world conditions that are like conditions under which it really matters. In abstract terms, there is then persistent inferential uncertainty about what purported proxies mean — about the relationship between the probabilistic cues from which we would like to infer a particular categorization, and the actual signal of interest. Lab studies can’t resolve this uncertainty due to their high artificiality in comparison with real-world contexts, and researchers conducting lab studies tend to have perverse incentives to inflate their accuracy rates.
This uncertainty makes it difficult if not impossible to sort the false from true positives and false from true negatives given screening results — a big sorting job when the screening involves entire populations. Sometimes this is not a problem, as in the case of screening all pregnant women for HIV even in countries with low prevalence. Secondary screening can sort out which positives are false, so the inferential uncertainty does not persist. By contrast, in contexts where this uncertainty does persist, as in many criminal and cancer screenings, the secondary screenings and treatments used to try to obliterate this uncertainty and mitigate risks can do a lot of harm — incurring net damages. This points to the third part of problematic uncertainty with respect to validating these screenings…
Net effects given secondary screenings
In order to know if mass screenings make sense, we need to know the net risk-benefit effect of secondary screenings to disambiguate correct and incorrect classifications. These secondary screenings often wreak havoc on society without being offset by measurable net benefits. Mammography screening for early breast cancer detection is one particularly well-researched example. There’s not a proven all-cause mortality benefit to offset the costs, including a roughly 10% false positive incidence associated with risky follow-up diagnostics like biopsy and partial/complete breast removal of uncertain benefit. So we aren’t sure what the net mortality picture really is, and different well-informed people can reach different conclusions about what’s rational for them on the basis of the best available evidence.
Overall, validating mass screenings for low-prevalence problems is hard when there’s uncertainty about rarity — especially when people have incentives to lie about what we’re interested in, uncertainty about how well we’re actually classifying the world based on imperfect cues, and uncertainty about the net benefit-harm balance of secondary screenings to sort correct from incorrect classifications. What does that mean for research integrity AI?
The validation problem in practice
We should be suspicious of tech like this, because it is difficult to see how it could work. There are indications in McIntosh et al that it doesn’t. These markers don’t look like very good proxies for research integrity after all if you glance at Figure 4 and know which of these publishers is considered possibly predatory (p. 8), like Frontiers, Hindawi, and MDPI. There are also recent complaints of predatory behavior from Wiley and mass editor resignations from several Springer Nature journals. But even without digging into this, anyone who knows the quality of publications in these journals should be able to skim the list and see that rank-ordering Frontiers, Hindawi, and MPI (possibly predatory journals) ahead of BMJ, Springer, and Oxford University Press publications on quality lacks face plausibility. (Most scientific publishing is predatory and should be burned down to the ground, but that’s a different story.)
Why does this matter? Well, it’s unclear what “trust markers” proxy for. But it seems unlikely to be research quality, which is ultimately why we care about reproducibility.
The reproducibility crisis affords profit opportunities to AI hucksters who bring out algorithmic processes to “do something” about the problem, but that something appears to be neither ethically nor scientifically rigorous. This should not be particularly surprising to any student of human character or science history. Of course where there’s money to be made, there are people doing shady stuff to make it.
As a matter of mathematics, using algorithmic processes to conduct mass screenings for low-prevalence problems for profit is one thing. Using them to assess a particular case that has already been flagged in the media is something else entirely…
Category 2 - Targeted (aka High-Risk, Selective, or Symptomatic) Screening
One one hand, it makes sense to test, and to give positive test results more weight as likely true, when they comport with your suspicions based on background information. For instance, a positive HIV test result in an IV drug user is more likely to be a true positive than one in a pregnant housewife who recently had Covid (a possible false positive risk factor).
On the other hand, a problem can arise when experts think results are right because they confirm their prior beliefs. This can open the door to common cognitive distortions including confirmation bias, bandwagoning, and cognitive closure.
So what distinguishes appropriate uses of screening tools in targeted screening contexts, from inappropriate ones? Or, why isn’t it best to use all the tools and all the information at your disposal, all the time?
Just like with mass screenings for low-prevalence problems, the calculus here depends on what we know (or don’t know) about the problem’s rarity, persistent inferential uncertainty (how good the probabilistic cue is and how well we can validate that association IRL), and secondary screening harms (accruing to the net benefit-harm balance). The validation problem is at the heart of all this, along with common misunderstandings of what it means to update assessments based on multiple sources of information. The causal revolution offers the appropriate toolset to begin to address that problem. This is about following logical and statistical chains of evidence.
If a classification aid is validated, then placing more weight on positive results in a high-risk subgroup is good Bayesian updating. If, on the other hand, it’s an unvalidated product for sale, then placing more weight on positive results consisting mostly of manual curation resulting from political opposition to dissenting science and news coverage thereof — as in the case of McIntosh & Vitale’s dunk on Coleman — lacks valid scientific evidentiary basis.
While it doesn’t make sense as a matter of science, it does make sense as a matter of business: It promotes the algorithmic process that ostensibly flags the unpopular science by appearing to validate it. It advertises the tech’s purveyors as being on the winning socio-political team. It is confirmatory evidence-seeking that produces pseudoscience with the “right” results according to powerful social and political networks. In this sense, bias in bias science isn't just about the inescapability of perspective, but also about the perceived value of work with the right kinds of bias.
Scientific validation works the other way. Tests don’t gain validity when they confirm what we think we already know. They might garner more social and political support when they do; they might make more money for their purveyors. But validation means we have a chance to learn if we are wrong in suspecting a particular case, and if the tool we’re using to classify it is wrong, too. There doesn’t appear to be any evidence that McIntosh & Vitale gave Coleman (or anyone else) that chance. McIntosh said she had made no attempt to discuss any related issues with any relevant authors, editors, or reviewers.
This is not sui generis. As I wrote previously, many researchers and other professionals use screenings for low-prevalence problems “without attention to… false positive problems. See, e.g., UT Austin Assistant Professor of Sociology Sara Brayne’s anecdote from her LAPD Gotham fieldwork:
In Brayne’s fieldwork, Palantir’s technology came in for special praise… But up close, the software was only as good as the people maintaining and using it. To make sense of Palantir Gotham’s data, police often need input from engineers, some of whom are provided by Palantir. At one point in her research, Brayne watched a Palantir engineer search 140 million records for a hypothetical man of average build driving a black four-door sedan. The engineer narrowed those results to 2 million records, then to 160,000, and finally to 13 people, before checking which of those people had arrests on their records. At various points in the search, he made assumptions that could easily throw off the result — that the car was likely made between 2002 and 2005, that the man was heavy-set. Brayne asked what happened if the system served up a false positive. ‘I don’t know,’ the engineer replied. (“How the LAPD and Palantir Use Data to Justify Racist Policing,” Mara Hvistendahl, The Intercept, Jan. 30, 2021).
There is widespread under-recognition of the persistent problem of imperfection, and the accuracy-error trade-off it invokes, in increasingly common uses of algorithmic processes to make important classifications in a wide range of domains.
Double Standards
There are serious concerns about McIntosh’s inconsistency and nontransparency in reporting the case selection procedure, the implications of probability theory for McIntosh and Vitale’s algorithmic processes, and their apparent failures to distinguish different use cases (mass versus targeted screenings), their dangers, and targeted screening versus confirmatory evidence-seeking. In addition to these concerns, it’s also worth noting the double standards in play here in terms of transparency of disclosures, neutrality requirements, post-publication review, and reproducibility.
Transparency of disclosures
McIntosh et al emphasize “a statement of how the research is funded” as a hallmark of research integrity (white paper p. 2). And they flagged Coleman 2022 in part for reporting no funding for the article. But their own article flagging hers also reported no funding. This statement is arguably more suspect for McIntosh than for Coleman, because McIntosh (as the article discloses) makes a living using the software the article reports using.
McIntosh & Vitale’s disclosure statement reads: “Leslie McIntosh is a full-time employee and Vice President of Research Integrity at Digital Science. Cynthia Hudson Vitale receives a part-time salary from Digital Science. Both provide advice on products related to research integrity.” So if the authors worked on this article on company time, the funding statement “This work was not supported by any funding” would seem to be false. If they did not, they still have a monetary interest in promoting the algorithmic processes the article describes using. By contrast, Coleman is a retired professor; it’s not clear she has or had a current income stream that comes from related work.
There is symmetry in the possibility of conflict of interest here. There’s no full accounting of income disclosing possible relevant consulting clients from either McIntosh or Coleman. Both could make money from related work.
But there is asymmetry in the known conflict of interest: We know McIntosh makes money from related work, because her disclosure statement says so. It leaves open “the questions of current funding and what clients and topics are served in private practice” — specifics that might “reveal potential conflicts of interest” (Greenland p. 969).
The larger point is that, as Greenland argues, transparency of disclosures may be “easier said than done” (“Transparency and disclosure…” p. 969). Just as McIntosh and Vitale note that they “may have missed associations [that could signify conflict of interest] as there is no registry of organizations affiliations” (McIntosh & Vitale p. 13), so too there is no “registration of consults, so that disclosures could cite complete listings at a central registration website,” meaningful enforcement mechanisms for such a registry don’t exist, and extensive disclosure risks burying key conflicts of interest in details, anyway (Greenland, p. 969).
Neutrality requirements (balance, fairness)
McIntosh & Vitale assume that association with political organizations should be identified as a potential conflict of interest: “We identified multiple organizations associated with one or more of the editors, reviewers, or authors that are political and should be identified as a potential conflict of interest” (p. 9). They cite numerous pro-life affiliations of guest editors, reviewers, and authors who contributed to the special issue in which Coleman 2022 appeared (Table 3). They go even further, noting one guest editor’s affiliation with Catholic organizations before saying that Catholic people and organizations “may have personal biases around abortion,” but “may not be actively lobbying or advocating for anti-abortion policies. Hence, this is not considered a conflict of interest” (p. 9-10).
It is not clear that policy-level beliefs or relevant political associations should be considered conflicts of interest in science. It is alarming to see a scientist’s faith mentioned as a possible reason to consider his work suspect. Raising this as an issue before dismissing it is enough to bring inappropriate consideration into scientific discourse where it does not belong. If this were a tenure meeting, it would be grounds for a lawsuit.
There are three reasons we might not want scientific publishing to be structured like this.
Constraining liberties and perspectives
One could argue that scientists should disclose relevant policy beliefs, political associations, and even personal experiences in the interests of transparency, in order to let readers evaluate for themselves these dynamics for how they might compromise neutrality, shaping evidence observation, analysis, and interpretation. However, it’s not clear where such disclosures should begin or end. Should a researcher studying sexual violence be required to disclose if she has experienced it?
Privacy is a liberal democratic value that extends to freedom of thought, expression, and association, and so requiring personal disclosures including about associations may create value conflicts between neutrality/fairness and liberty/privacy. Expansive disclosure requirements could also enable marginalization of (substantively) dissenting and (structurally) nontraditional perspectives…
Reproducing inequalities and errors
Similarly, one could argue that scientists should seek out peer reviewers who may disagree with their positions in order to maximize the potential for constructive criticism. That’s one idea of how peer review is supposed to work.
But you also generally don’t want to hone your arguments with people who disagree with you so strongly that they won’t entertain them. There’s limited intellectual value in ideological deadlock. One might expect such limitation to apply to hyperpolarized discourses like abortion, Covid, and infant feeding.
When the consensus is wrong and you intend to burn it down to the ground and start over, you don’t expect to walk in the front door, get a bunch of experts to help you put them out of work, and walk away whistling.
Masking bias (cloaking asymmetry)
Bracketing both of those possible problems with McIntosh & Vitale’s interpretation of conflict of interest reporting requirements, there’s a symmetry issue here: If pro-life affiliations count as conflicts of interest that require reporting as such, then pro-choice affiliations must, too. These are common affiliations among abortion researchers in general. In this way, neutrality concerns can reflect power masquerading as a lack of perspective that human beings cannot actually attain (“I’m not biased; I’m with the status quo”).
Here, the lines between transparency/disclosure and neutrality/balance blur through interpretive lenses that can be political. Because in order to answer the question “Transparency/disclosure of what?” we have to define conflict of interest. If you don’t think it’s required because you’re already neutral, that’s more likely to be a perspective associated with a consensus position — with power — than it is to be one associated with difference, dissent, or destruction of the current order.
This plays out in the current abortion science discourse, in which the consensus pro-choice perspective is miscoded as neutral, while the dissenting perspective is coded as a conflict of interest, reflecting a double standard. This is not sui generis, but rather illustrates the larger problem of how power can masquerade as neutrality, compounding asymmetries. Beyond critiquing the operationalization of neutrality itself as necessarily political, this can also be seen as one example of many ways in which rules are often applied to the vulnerable and the different more stringently than they are to the powerful and conforming (“stick out, get hammered”).
Here, Coleman’s retracted 2022 article appeared in a Frontiers in Behavioral Neuroscience special issue on “Fertility, Pregnancy and Mental Health - a Behavioral and Biomedical Perspective.” McIntosh & Vitale 2023 treat the entire special issue as a case study, though, as noted, the article says they selected the case specifically when their AI flagged Coleman’s article (a description which seems to be false according to McIntosh). About the special issue, they also claim:
Two articles were found to have undisclosed conflicts of interest between authors, an editor, and multiple peer reviewers affiliated with anti-abortion advocacy and lobbying groups, indicating compromised objectivity. This lack of transparency undermines the peer review process and enables biased research and disinformation proliferation” (p. 1).
Compare these findings with analogous possible conflicts of interest and affiliations in the case of abortion proponent researchers involved in producing these Frontiers articles:
“Media advocacy in catalyzing actions by decision-makers: case study of the advance family planning initiative in Kenya,” by Choge et al, Frontiers in Global Women's Health (Vol. 4, June 2023). This article was edited by Tamara Fetters, a Senior Researcher at Ipas. The Ipas website says “The Ipas Impact Network works globally to advance reproductive justice by expanding access to abortion and contraception.” The article was reviewed by Beatriz Galli and Noreen Fagan, both affiliated with Ipas. The authors list affiliation with the Advance Family Planning Project and International Center for Reproductive Health, both in Kenya.
“Abortion as an Essential Health Service in Latin America during the COVID-19 Pandemic,” by Michel et al, Frontiers in Global Women's Health (Vol. 3, Aug. 2022). This article was also edited by Fetters. It was reviewed by Yan Che, who according to a 2021 publication has been affiliated with the NHC Key Lab. of Reproduction Regulation, Shanghai Institute of Planned Parenthood Research. According to the Global Early Adolescent Study, SIPPR “was established in 1978 and affiliated to Shanghai Academy of Science & Technology and is a social welfare research institution in the field of family planning and reproductive health.” The article was also reviewed by Deborah L. Billings, who has worked at Ipas. The authors list affiliation with the Latin American Consortium Against Unsafe Abortion, which Human Rights Connected describes as “a joint [sic] made up of activists, researchers, health service providers and professionals that contributes to the reduction of unsafe abortion in Latin America. It promotes access to information and modern and safe technologies within the framework of full respect for sexual and reproductive rights, from a gender and equity perspective.” (The consortium’s own website seems to not be working.)
“Saving more lives on time: Strategic policy implementation and financial inclusion for safe abortion in Indonesia during COVID-19 and beyond,” by Putri Widi Saraswati, Frontiers in Global Women's Health (Vol. 3, Sept. 2022). This article was also edited by Fetters. It was reviewed by Rasha Dabash and Deborah Billings, who have both worked at Ipas. The title and article make particularly strong and unproven assumptions about abortion net saving lives, reflecting a particular policy position.
There are many more such examples. The point is that, in reality, abortion researchers (like other researchers) often publish science within networks that share their policy positions or other foundational assumptions. They possibly couldn’t publish it any other way due to the hyperpolarized nature of the discourse and the short supply of quality peer reviewers. And it’s only the pro-life network that gets hammered for this. As if it’s only a conflict of interest when it pertains to a dissenting position. It’s only enabling “biased research and disinformation proliferation” (McIntosh & Vitale) when it expresses a view or has social and political implications that are at odds with the consensus. (Tangentially, “disinformation” is usually used to refer to information a hostile foreign power puts out in a competing information environment, e.g., Russian narratives about Ukraine distributed in the West via social media bots; M&V mean “misinformation,” an inherently political designation often used to discredit one’s opponents under conditions of epistemic uncertainty.)
Post-publication review (retractions)
Similarly, there are double standards in post-publication review when it comes to abortion science. An example illustrates the problem.
In March 2023, I had a few exchanges with Retraction Watch staff about having found that the authors of a recent article on abortion access and suicide risk had engaged in p-hacking and reported demonstrably false findings. The article was Zandberg et al’s Dec. 2022 JAMA Psychiatry “Association Between State-Level Access to Reproductive Care and Suicide Rates Among Women of Reproductive Age in the United States.” This post described the evidence of p-hacking and why the published findings are false, and called on the authors to retract.
P-hacking alone is grounds for retraction, and the evidence of it here was overt and incontrovertible. An email exchange with the journal editor proved predictably fruitless. I contacted a number of news outlets that had uncritically reported the false findings, and only one responded; an editor I’ve worked with said he would pass my email onto his science editor, but nothing came of it.
Over email, I asked Retraction Watch to cover the case. Editor in Chief Ivan Oransky responded “… we really can't provide advice or suggestions in individual cases. That would be like a business reporter giving stock tips” (Mar. 25, 2023). It was, however, news enough for Retraction Watch to cover it when pro-choice Twitter commentators attacked Coleman and her colleagues who worked on the special issue that published her 2022 article.
At the time, I concluded that perverse incentives drive journal editors to put cynicism over substance — publishing bad science and then defending it — because retractions can cause headaches (e.g., lawsuits, reputational costs) that they can avoid by just silencing critics.
The interesting question, I think, is not what happens next. Authors who cheated have disincentives to admit it. Editors who published science fiction have disincentives to correct the record. Reputations are at stake. Retraction looks like an admission of guilt.
So it’s not generally worth one’s time or energy to try to get single instances of false findings retracted. This suggests that retraction may be the policy in cases like false findings, but it’s not the social and political reality. There are, rather, complex dynamics influencing what comments get posted, what letters get printed post-publication, who can engage on auxiliary platforms like PubPeer, and what retraction requests usually get made publicly, much less enacted. And those dynamics also affect science reform projects like Retraction Watch, not because it’s particularly a bastion of bias and corruption, but because it’s run by human beings.
In a less-imperfect world, science reformers would have some way of assessing post-publication review bias. In reality, though, there’s no meta-data on the social and political determinants of attempted versus successful retractions. It would be really hard to study bias in post-publication review.
It would probably take a substantial public outcry to get McIntosh & Vitale 2023 retracted, the way it took a substantial public outcry to get Coleman 2022 retracted. Powerful social and political networks had the manpower to mobilize for the latter, while they don’t have an interest in doing that for boring old anti-corruption work — of which there is anyway too much to do, to do it all. This is a garden-variety special interest versus public good problem. But it also illustrates that science reform mechanisms risk being coopted by the powerful to punish dissent, enforce conformity to preferred establishment narratives, and promote profit — ironically in the name of research integrity — at the expense of research integrity itself.
Reproducibility
McIntosh et al identify elements of transparency and reproducibility as research integrity trust markers (Table 1, p. 5). The identified reproducibility elements are: repositories, data locations, data availability statement, code availability statement, and analysis software. But the software McIntosh & Vitale used is in development. So theirs is non-transparency transparency software used to produce irreproducible reproducibility research.
So What?
Just as bias pervades bias science because science is done by people, so too do methodological and ethical problems affect research on research integrity that is supposed to be about rigor and righteousness. But, as always, you have to read beyond the abstract to see these kinds of problems. Otherwise, busy people might well read Retraction Watch’s associations with McIntosh as an endorsement of her research and related professional services. Research in which nontransparent algorithmic processes from which researchers stand to profit seem to have been used post hoc to both smear scientists as unethical without actually evaluating the merits of their work, and to promote those algorithmic processes as objective, neutral, and transparent, when evidence is insufficient to establish them as any of those things. McIntosh’s Digital Science bio boasts “Dr. McIntosh’s work was the most-read RetractionWatch post of 2022.”
It’s not a one-sided relationship; both McIntosh and Retraction Watch seem to use each other to bolster their credibility. There is no achieving perfect neutrality, but algorithmic processes like those McIntosh et al promote can offer organizations like Retraction Watch a veneer of neutral justification for making discretionary administrative decisions in running bureaucracies, interpretive choices in evaluating evidence, and editorial choices in deciding what suspicious research is news. Tech can give administrators and other experts a great justification slop — a mix of information leftovers that is unfit for human consumption.
This suggests bias in bias science is not only a problem of the inescapability of perspective. It’s likely worse the more polarized a discourse is, not just because cognitive biases and emotions distort reasoning, but also in response to financial, social, and political incentives to do shoddy or even fraudulent research. And those incentive structures include asymmetries in disincentives — as when only one side in a hyperpolarized discourse is subjected to heightened scrutiny, creating double standards for extreme outcomes like retraction.
On one hand, this is just a description of the structure of science as a social enterprise. Publications have professional and social values for authors, with these incentives notoriously contributing to individual behaviors that degrade the ecosystem as a whole (the classic citation contains Altman’s clarion call that “We need less research, better research, and research done for the right reasons”). So there’s a whole universe of corruption across diverse domains solely in inflationary accuracy rates — rates in which researchers have financial and professional interests. People are just being people, everywhere.
On the other hand, when these incentives and their counterbalancing disincentives are skewed, favoring the preferred narrative of powerful social and political networks, then that magnifies distortion in the discourse, further diluting the science in science (to paraphrase Greenland) by making critical thinking unthinkable. The spiral of silence likely plays a role, too, as fear of social isolation for breaking with the “right” position keeps people with moderate and dissenting views from expressing them. This would tend to drive further polarization.
Bias in bias science that is potentially profitable and socially desirable may seem particularly ironic in the sub-corridor of research integrity research, which is ostensibly intended to help advance scientific norms such as honesty and pure motives. Or not, depending on your frame of reference. “Just as every cop is a criminal, and all the sinners saints,” the duality of good and evil is as inescapable a facet of being human as is perspective itself. People respond differently to having the potential to do wrong, and people who are attracted to that power for the wrong reasons are probably more likely to seek it out (as in Plato’s disinterested best rulers in The Republic).
The good news is that we don’t have to choose between better science and better ethics when it comes to reform. Better science implies better ethics. Being able to make real-world decisions on the basis of better information about what we really know (and don’t know) is the bedrock of hurting people less, if you’re into that sort of thing. We know how to do this. It may take more time and get fewer clicks. But society benefits when researchers slow down and do things right.
In the meantime, science reform may need to consider the age-old question, “Quis custodiet ipsos custodes?” (Juvenal) Who will guard the guardians?
You raise very good points on the challenges of science as a social enterprise and the fact that scientists' incentives are not, by default, aligned with the social goals officially assigned to this enterprise. How science happens to "work" in spite of these challenges is a fascinating question. I think that the practical institutions and norms in science have emerged (and continue to evolve) as ways to address this problem of incentives: https://lionelpage.substack.com/i/140888694/conclusion-to-this-series-science-as-a-model