Statistical significance testing misuse and spin hide possible harm from RSV vaccination in pregnancy
Two articles in the latest Obstetrics & Gynecology misuse statistical significance testing, downplaying possible harm from RSV vaccination in pregnancy.
Reporting the results of a global phase 3 randomized double-blind trial on Pfizer’s bivalent protein-based nonadjuvanted RSV vaccine, Madhi et al wrote:
Preterm birth rates were 5.7% in the RSVpreF arm and 4.7% in the placebo arm (relative risk [RR] 1.20, 95% CI, 0.98-1.46)… no clinically significant increase in adverse events of special interest, including preterm birth, low birth weight, or neonatal hospitalization, was observed… — “Preterm Birth Frequency and Associated Outcomes From the MATISSE (Maternal Immunization Study for Safety and Efficacy) Maternal Trial of the Bivalent Respiratory Syncytial Virus Prefusion F Protein Vaccine.”
Editorializing on this safety analysis as well as the accompanying efficacy and immunogenicity analysis by Simões et al, Dugdale et al wrote:
As was seen in the RSV MAT-009 study, PTB [preterm birth] rates with vaccine and placebo were identical in high-income countries but were increased among vaccine recipients in LMICs (RR 1.73, 95% CI, 1.22-2.47). This is consistent with a well-designed postlicensing observational study of the Pfizer vaccine in New York State, demonstrating no association with PTB — p. 2, “Respiratory Syncytial Virus Vaccination in Pregnancy: Safety, Efficacy, and Global Implications.”
This statement contains two mistakes: First, Dugdale et al’s Table 1 above the text shows that, in MATISSE, the preterm birth risk with vaccine may have increased regardless of country subgroup (95% CI .98-1.46). This compatibility interval overlaps with the statistically significant subgroup finding associating preterm birth risk with vaccination in low and middle income countries (95% CI 1.22-2.47).
Second, the cited observational study did not demonstrate no association between vaccination and preterm birth risk. Rather, it found:
During the study period, 60 patients who had evidence of prenatal vaccination (5.9%) experienced PTB vs 131 of those who did not (6.7%). Prenatal vaccination was not associated with an increased risk for PTB after adjusting for potential confounders (adjusted OR, 0.87; 95% CI, 0.62-1.20) and addressing immortal time bias (hazard ratio [HR], 0.93; 95% CI, 0.64-1.34). — “Nonadjuvanted Bivalent Respiratory Syncytial Virus Vaccination and Perinatal Outcomes,” Son et al, JAMA Netw Open, July 2024, 1;7(7):e2419268.
In other words, Son et al also misinterpreted their statistical significance test results. Full interpretation of their adjusted compatibility intervals shows possible increased risk of preterm birth associated with vaccination.
Overall, Madhi et al, Dugdale et al, and Son et al all reported results consistent with substantial possible preterm birth risk increases associated with maternal RSV vaccination in pregnancy; but they all misinterpreted their statistical significance test results, downplaying the possible risks. Their analyses actually suggested preterm birth rates may be 2% lower to 46% higher for babies of vaccinated moms (Madhi et al), 22-147% higher in low and middle-income countries (Dugdale et al), and 36% lower to 34% higher accounting for immortal time bias (Son et al).
All three sets of authors’ misuse of statistical significance testing supported a preferred narrative that the vaccine net benefits infants. Other elements of spin in Dugdale et al’s editorial serve the same purpose. For instance, the first page frames vaccine hesitancy as a result of modern ignorance of the preventive health benefits of vaccines; the second notes “Reassuringly, most of the PTBs [in MATISSE] were late preterm, that is, at 34 weeks of gestation or later”; and the third and final page spends most of its text issuing more possible reasons why the findings could be spurious or subgroup-specific (e.g., relating to COVID-19, differences in post-vaccine safety surveillance systems, and limited ultrasonography access in lower-income countries compromising pregnancy dating). This reads like an exercise in confirmation bias.
Meanwhile, there’s no data here on net effects to back up the implicit claim that the vaccine confers net benefit rather than incurring net harm. It is odd to minimize the risks of preterm birth — the leading cause of infant and child deaths. Infant mortality and morbidity risks at 34-36 weeks continue to be non-negligible, and the latter may be associated with long-term impairments.
Dugdale et al also don’t mention that there’s an alternative to maternal RSV vaccination: infant vaccination. Or that the alternative (nirsevimab) is not really a vaccine. It’s “a monoclonal antibody recommended to children for the prevention of severe RSV disease” and “designed to bind to the fusion protein on the surface of the RSV virus.” So instead of giving the body the pathogen in a neutered form to learn in advance of an attack, it tells the body how to fight it.
History suggests this might be a wiser strategy when it comes to protecting infants from RSV. In 1969, a study on a formalin-inactivated vaccine for the newly discovered RSV in babies under six months old found that severe disease was much more frequent in vaccinated infants, with 80% requiring hospitalization during the next winter season, and two dying — compared to 2.5% of unvaccinated infants hospitalized, and no deaths. Ironically, Dugdale et al dismissed vaccine hesitancy as a result of historical ignorance; but the best available data are as yet ambiguous on maternal RSV vaccination and preterm birth risk, while the history of RSV vaccination and infant disease/death gives cause for concern that intervention may do more harm than good.
We need to see net effect assessments of different possible policy universes before implementing mass interventions. Until we do, the precautionary principle applies. And the burden of proof is on proponents to demonstrate that new interventions don’t cause preventable harm.
Here, that means comparing the safety and efficacy of maternal RSV vaccination in pregnancy with the safety and efficacy of treating infants with nirsevimab to prevent severe RSV. Given the possible preterm birth risks associated with the former, the latter looks safer. It may also be more effective, with one trial reporting 74.5% efficacy (95% CI 49.6-87.1) and another reporting 83.2% efficacy (95% CI 67.8-92), versus MATISSE’s 70% efficacy (50.6-82.5). In other words, these interventions’ reported efficacy intervals overlap, but some data suggest higher efficacy for the potentially lower-risk option, nirsevimab; more research is needed to assess the possibility of clinically important efficacy differences.
***
German AI-assisted mammography study echoes Scandinavian findings, highlighting overdiagnosis concerns
A recent opt-in study comparing mammography screening with or without AI assistance at 12 sites in Germany echoed recent findings from Sweden and Denmark: researchers found that AI assistance slightly improved breast cancer detection rates and lowered recall rates, but increased DCIS (ductal carcinoma in situ) diagnoses (“Nationwide real-world implementation of AI for cancer detection in population-based mammography screening,” Eisemann et al, Nature Medicine, 2025). Experts disagree about the implications of these findings.
Eisemann et al concluded:
… based on the now available evidence on breast cancer detection, recall rates, PPV of biopsy and time savings, urgent efforts should be made to integrate AI-supported mammography into screening guidelines and to promote the widespread adoption of AI in mammography screening programs.
In contrast, leading Swedish breast cancer radiology researcher and Lund University Associate Professor Kristina Lång told The Guardian the large increase in DCIS diagnoses raises concerns about overdiagnosis: “ ‘The results are encouraging, but it is essential to ensure that we implement a method capable of detecting clinically relevant cancers at an early stage, where early detection can meaningfully improve patient outcomes.’ ” The goal is saving lives (i.e., decreasing deaths) — not increasing diagnoses (if that means increasing breast cancer cases without improving clinical outcomes).
Lång’s stated research aim “is to contribute to a further reduction in the mortality of breast cancer.” According to the Harding Center for Risk Literacy, estimated breast cancer mortality reduction from mammography is 1/1000 screened women, in trade for 100/1000 false alarms and 5/1000 unnecessary partial or complete breast removals. All-cause mortality reduction from mammography has not been established.
As Lång and Eisemann et al alike noted, we need longer-term follow-up to better assess the effects of AI-assisted mammography screening. In particular, Eisemann et al wrote that they plan 2-3 year follow-up analyses on “higher detection of DCIS and grade 1 cancers” to evaluate how they affect “interval cancer rate and stage distribution.” The key question is whether AI-assisted mammography detects DCIS and grade 1 cancers likely to progress to clinically meaningful cases, or those unlikely to do so.
Eisemann et al also wrote they plan to investigate:
whether examinations for which the safety net was triggered but rejected by the radiologists represent a correct decision by the reader and thus a critical safety measure to reduce recall and overdiagnosis. Possibly, these cases were missed opportunities to detect even more cancers early and to improve overall program performance further. These questions will be investigated in the 2- to 3-year follow-up analyses.
Lång highlighted the bigger-picture point that “ ‘Long-term follow-up is essential to fully understand the clinical implications of integrating AI into mammography screening.’ ”
***
DCIS trial raises risk, reporting, and classification questions
A recent Preliminary Communication on a prespecified primary analysis of two-year outcomes in a randomized trial conducted from 2017-2023 on 957 U.S. women aged 40+ with low-risk (hormone receptor–positive grade 1 or grade 2) DCIS without invasive cancer found active monitoring non-inferior to the standard surgery with or without radiation therapy (“Active Monitoring With or Without Endocrine Therapy for Low-Risk Ductal Carcinoma In Situ: The COMET Randomized Clinical Trial,” Hwang et al, Dec. 12, 2024, JAMA Network). Active monitoring entailed “follow-up every 6 months with breast imaging and physical examination.”
It appears the authors have not yet published a fuller analysis with complete compatability intervals. Instead they reported results based on noninferiority bound thresholding of 0.05%. This should ring a bell; it looks a lot like p-value thresholding, the discredited but widespread practice of misinterpreting statistical significance testing results that often leads to dismissal of potentially important effects and hyping of potentially spurious ones.
Hwang et al’s preliminary report said:
the 2-year Kaplan-Meier cumulative rate of ipsilateral invasive cancer was 5.9% [27 women] in the guideline-concordant care group vs 4.2% [19 women] in the active monitoring group, a difference of −1.7% (upper limit of the 95% CI, 0.95%), indicating that active monitoring is not inferior to guideline-concordant care.
These results actually suggest it is possible that more radical intervention causes an increased risk of invasive cancer on the same side. That would be a very important finding. Possible causal mechanisms of such a perverse effect could include surgery spreading breast cancer cells to neighboring lymphatic tissue.
On one hand, we do see a small breast cancer mortality reduction from mammogram screenings. On the other hand, what if that small decrease is being counter-balanced by a smaller increase in other breast cancer mortalities due to this unintentional cell spreading? We can’t tell from available data if that is the case.
It’s not clear we would be able to tell much about this possible iatrogenic effect from this study, since the authors also wrote “346 patients had surgery for DCIS, 264 in the guideline-concordant care group and 82 in the active monitoring group.” In other words, you can’t ethically randomize women with DCIS to get surgery or not; it’s their choice. So the groups probably weren’t actually randomized to receive the treatment of interest (surgery with or without radiation), or not. Perhaps women were randomized to get different information when making that choice. But the preliminary report doesn’t say, so we don’t know. This discrepancy suggests this was not a randomized trial on what the preliminary report text implies it was, i.e., whether more radical (surgery or surgery + radiation) intervention for low-risk DCIS demonstrably decreases invasive cancer rates at two years compared to active monitoring, or not.
Nor do we get information in this report pertaining to results on what the preliminary report title said it was about, i.e., “active monitoring with or without endocrine therapy for low-risk ductal carcinoma in situ.” Maybe that title reflects what was actually randomized here; maybe not. Either way, the discrepancy between the treatment in the report title (endocrine therapy) and the treatment in the report text (surgery+) is disconcerting. The title implies we should learn something about endocrine therapy treatment and outcome differences here, but we do not. Does a clinical trial preregistration resolve this discrepancy?
This one only generates more questions. For instance, it specified the sample to include HER2 equivocal or HER2 negative and HR positive DCIS diagnoses. But the preliminary report dealt only with HR positive cases. Why report results only for a subgroup analysis? Or were the sample selection criteria changed after preregistration? Also, the preregistration keeps saying “randomized,” but it’s not clear what was randomized.
All that said, the preregistration’s detailed description makes an excellent argument for the importance of this vein of research:
Overdiagnosis and overtreatment resulting from mammographic screening have been estimated to be as high as 1 in 4 patients diagnosed with breast cancer although the absence of standard definitions for measuring overdiagnosis has led to much uncertainty around this estimate… There is general consensus that much of this burden derives from the treatment of DCIS… In those women who undergo surgical management of DCIS, there is risk of developing persistent pain at the surgical site, with estimates ranging from 25-68%… leading to disability and psychological distress… Although prospective population-based data have demonstrated… remarkably high levels of chronic pain 4 and 9 months after breast surgery, much of these data have been collected in women with invasive cancer, with little data directly relevant to patients with DCIS.
In this context, the COMET trial fits into a wider debate about whether to classify DCIS as cancer and DCIS diagnoses as false positives in mammography screening. Susan Bewley, a renowned British evidence-based medicine reformer, King’s College London Emeritus Professor of Obstetric and Women's Health, and leading critic of mass mammography screening for early breast cancer detection, has Tweeted: “… DCIS is not cancer, and it does not inevitably develop into invasive cancer… ” Covering the COMET trial preliminary report, investigative reporter Maryanne Demasi recently wrote “DCIS is not actually cancer but rather a cluster of abnormal cells contained within the milk ducts.”
The question “is DCIS cancer?” implies a binary answer. But some experts say it depends (e.g., on the grade). The same goes for whether DCIS diagnoses are false positives. These contested classifications reflect complexity and ambiguity in the underlying evidence as well as different assessments of what the goal of mammography is, and how much risk of what sort of harm to tolerate for how much benefit of another sort.
If you think the goal is correct test classification, DCIS is not a false positive per se — it’s a carcinoma classification, at least until there’s a consensus to change the name. Some low-grade DCIS cases do progress to invasive cancers; most don’t.
Conversely, if you think the overarching goal is improving the outcome of interest in the bigger picture — by using early diagnosis to improve the prognosis of clinically relevant cancer cases and/or decrease breast cancer, all-cancer, and/or all-cause deaths — then low-risk DCIS diagnoses should probably be called false positives. Recent research claiming to show AI-assisted mammography improves cancer detection while lowering false positives (e.g., Hwang et al 2024 and Lauritzen et al 2024) spins this ambiguity away into hype.
This is about more than mammography. The same questions — what is a false positive? is the goal (contestable) correct test classification, or improving the big-picture outcome of interest? — turn up again and again across mass screenings/interventions for low-prevalence problems. Here’s another instance…
***
Researchers tout Group B Strep PCR testing accuracy; but would universal pregnancy screening do net benefit or harm?
A recent meta-analysis found the pooled AUROC (area under the receiver-operating curve) for real-time polymerase chain reaction testing for Group B Strep in pregnant women was .99 (95% CI .98-1.00) (“Accuracy of real-time polymerase chain reaction test for Group B Streptococcus detection in pregnant women: A systematic review and meta-analysis,” Peng et al, Eur J Obstet Gynecol Reprod Biol, Jan. 2025, 304:141-151).
These findings highlight PCR testing as a highly accurate diagnostic tool, offering faster and more reliable detection of GBS during pregnancy compared with traditional culture testing. Correctly classifying all the cases sounds great; but is this really what we want?
The American College of Obstetricians and Gynecologists (ACOG) FAQ on Group B Strep and pregnancy notes: “About 1 in 4 pregnant women carry GBS. Although GBS is fairly common in pregnancy, very few babies get sick with GBS disease. The risk of GBS disease is higher in babies who are born before 37 weeks of pregnancy.”
The clinical importance of a positive GBS test in pregnancy is a matter of judgment. The absolute risk of related neonatal problems is quite low, while the population incidence of GBS is quite high. So if we define false positives not as cases wrongly diagnosed with GBS, but as cases wrongly treated with follow-up interventions to prevent (non-existent) neonatal problems, then the volume of false positives from universal GBS screening in pregnancy remains quite high even when the test accuracy is 99%.
Call it the false positive problem, call it overdiagnosis. It doesn’t really matter what we call this carnage. It matters that someone does a net cost-benefit assessment before implementing a mass screening (like GBS testing in all pregnant women) for a low-prevalence problem (like neonatal GBS disease). But we live in a society without guardrails for programs that share this dangerous structure.
The standard GBS intervention is IV antibiotics during labor, which carries its own costs and possible risks to infants and mothers alike. For instance, decimating the microbiome in a sensitive developmental period might not be a great idea; these IV antibiotics could adversely impact infant brains and bellies, dooming young families to months of colic, or even years of developmental differences. Similarly, tethering laboring mothers to IVs impairs their mobility in addition to causing pain and stress, all of which could adversely impact birth experiences and outcomes — with potentially long-term physical and psychological consequences.
Previous research suggests that the frequency of the benefit is quite small relative to the number needed to treat: National U.S. guidelines recommending screening and treatment for affected women were issued in 1996 and revised in 2002. Subsequent review of 2003-2004 multistate population-based labor and delivery records found 85% screening uptake and reduced incidence of invasive early-onset neonatal GBS disease from 1.8 cases per 1000 births in the early 90s, to .26 cases per 1000 births in 2010. That’s a difference of 1.54 cases per 1000 births, or .154%.
The true event of interest here is not GBS, but rather, clinically relevant neonatal GBS disease — a rare event. This rarity, combined with the persistent uncertainty from the screening about which cases of maternal GBS would go on to produce clinically relevant neonatal GBS disease, a lack of secondary screenings to assess that, and the presence of possible harms from resulting intervention, casts doubt on the presumptive net benefit of universal GBS screenings.
***
Thresholding and unclear causal logic in recent research on high blood sugar and pregnancy risks
Recent research finds type 2 diabetes and elevated periconceptional HbA1c both increase pregnancy risks, suggesting that maybe blood sugar control should be assessed on a continuum instead of as a binary variable — with the clinical goal being patient education to improve health, instead of dichotomizing to diagnose disease.
Clement et al’s Dec. 2024 Am J Obstet Gynecol “Pregnancy Outcomes in Type 2 Diabetes: a systematic review and meta-analysis” built on well-established findings that T2D increases adverse pregnancy outcome risks. Comparing T2D pregnancies with type 1 diabetes, gestational diabetes, and non-diabetic pregnancies, the authors found increased risks of neonatal and perinatal mortality; congenital anomalies, neonatal mortality, and stillbirth; and congenital anomalies, perinatal mortality, and stillbirth, respectively. One possible causal pathway is relatively impaired periconceptional glucose control, even in the absence of frank hyperglycemia. Another is higher BMI, often associated with T2D, as an independent risk factor for pregnancy complications.
In the same publication crop, Rotem et al’s Dec. 2024 Hum Reprod found maternal periconceptional “(HbA1c) levels over 5.6% were associated with an increased risk of congenital heart defects (CHD) in the offspring, and maternal preconception diabetes was associated with an increased risk of CHD, including when HbA1c levels were within euglycemic ranges” (“Maternal periconception hyperglycemia, preconception diabetes, and risk of major congenital malformations in offspring”). HbA1c is widely recognized as “an important indicator of long-term glycemic control with the ability to reflect the cumulative glycemic history of the preceding two to three months… a reliable measure of chronic hyperglycemia.” Others consider it a proxy for metabolic health more broadly. These are not necessarily different positions: HbA1c could both reflect the preceding months’ glycemic control on a continuum, and predict broader metabolic health, including metabolic syndrome.
Rotem et al argued their findings “[suggest] underlying causal pathways that are partly independent of maternal glucose control. Therefore, treatments for hyperglycemia might not completely mitigate the teratogenic risk associated with maternal preconception diabetes.”
This first part of this inference is invalid. Rotem et al’s evidence suggests that poorer blood sugar control predicted congenital anomalies even in the absence of hyperglycemia. But, rather than implying that another causal pathway is necessary to explain the findings, this could mean that thresholding imposes a binary classification on a non-binary reality: glucose control exists on a continuum, and impaired glucose control short of diagnosable disease (as manifest in elevated HbA1c) increases pregnancy risks.
In this sense, both studies suggest that thresholding — for hyperglycemia in T2D management (Clement et al) or HbA1c in periconceptional health (Rotem et al) — may reflect the common cognitive bias of dichotomania rather than a logical way to identify clinically important risks. Researchers should be wary of throwing away valuable information by dichotomizing continuous variables, and clinicians should help diabetic patients mitigate risks by improving metabolic health through safe and effective interventions. What interventions fit that bill?
What lowers blood sugar, also lowers HbA1c: oral antidiabetes agents, inositol, exercise, and semaglutide all appear to lower it in the same effect size ballpark (~.5-1.25%). So the blood sugar control intervention with the best overall safety and efficacy profile probably remains diet and exercise — if you can do it. Along with myo-inositol and vitamin D, it’s also the best evidence-based prevention for gestational diabetes, and we know those options are generally safe before and during pregnancy.
But patients with poor blood sugar control often need more help managing it than education about lifestyle changes can provide. And maybe Rotem et al are right that another causal pathway matters here, too — perhaps one having to do with BMI. This suggests a possible case for randomized trials on periconceptional semaglutide (Ozempic) in T2D patients to improve pregnancy outcomes. Especially since there’s a lot of accidental exposure anyway, so more systematic safety and efficacy data would seem to serve the public interest.
***
Statistical significance testing misuse hides possible contraceptive implant harms
Authors of a recent study on postpartum contraceptive implants and breastfeeding misinterpreted their results, wrongly claiming they show no effect when they show a substantial possible effect (Levi et al, “Immediate postpartum contraceptive implant placement and breastfeeding success in postpartum people at risk for low milk supply: a randomized non-inferiority trial,” Contraception, Dec. 2024).
The study randomized to three groups 155 women at risk for low milk supply who planned to breastfeed: contraceptive implant placement “within 30 minutes of placental delivery, 24-72 hours postpartum, or 6+ weeks postpartum.” It found substantial possible effects on how long it took for their milk to come in (lactogenesis II, LII):
Compared to those who received implants 6+ weeks postpartum, those who received it ≤30 minutes postpartum (mean difference: 2.92 hours, 95%CI: -9.26, 15.1, p=0.64) or 1-3 days postpartum (mean difference: -0.75 hours, 95%CI: -13.02, 11.51, p=0.90) did not have statistically significant different time to LII.
Interpreting these results according to statistical significance test thresholding, the authors concluded: “Ultimately, our study suggests that early etonogestrel implant insertion does not affect breastfeeding success among postpartum people at risk for low milk supply.”
However, full interpretation of the reported compatibility intervals actually suggests that those who had their implants installed ≤30 minutes postpartum may have had LII up to 15+ hours later than those who waited 6+ weeks, and those who had theirs put in 1-3 days postpartum may have had LII up to almost 12 hours later than those who waited 6+ weeks. Alternately, it’s also possible that earlier placement decreased time to LII by up to 9+ hours in the ≤30 minute group as compared to the 6+ weeks group, or up to 13+ hours in the 1-3 day group as compared to the 6+ weeks group. These wide intervals are compatible with longer or shorter times to mature milk production, or no effect.
What’s clinically important for women who want to breastfeed and opt for postpartum contraceptive implants is that we don’t know if earlier implant installation delays or impairs lactation. There are, however, good reasons to suspect that hormonal birth control may cause reproductive hormonal changes that adversely impact lactation.
Specifically, the implant in this study contained the progestin etonogestrel. Progesterone “interferes with prolactin action at the alveolar cell’s prolactin receptor level” by “(1) inhibiting up regulation of the prolactin receptor, (2) reducing estrogen binding (lactogenic activity), and (3) competing for binding at the glucocorticoid receptor.” So we don’t know if getting etonogestrel implants sooner postpartum delayed or otherwise interfered with lactation in this study, but do we know that there’s a plausible causal explanation for that possible effect.
Clinically, it could matter if infants have to wait an extra 12-15 hours for their mother’s mature milk to come in. Such a delay could contribute to common and preventable harm to neonates associated with breastfeeding insufficiencies including delayed onset of LII. Medical complications from accidental starvation in this context include jaundice, which may substantially increase permanent neurodevelopmental risks.
Entire villages have been built off mammography, and its grateful "survivors" are legion. The biggest moves in this DCIS space have been from actual patients knowing deep down in their gut that something is off with the cascade. Thanks for your efforts to document factually.