Adjusting the Santa Clara County COVID-19 Antibody testing study results for self-selection bias

Share Now

A drive-through test site in NY. Wikimedia commons license.

The infection fatality rate (IFR) of COVID-19 is one of the most important parameters for mathematical models of the pandemic, yet it remains largely a mystery because we don’t yet know how many people have actually been infected. The IFR is the number of people who die from COVID-19 divided by the number of people who get infected.  When you see the number of reported cases each day in the news, these reported cases include only people who have been diagnosed as having COVID-19 disease, or who have tested positive, but it does not include people who never get tested or are asymptomatic.  Therefore, studies to assess the total number of infections are super important.  This morning the results of a large-scale Stanford University study that tested for the presence of antibodies in the population of Santa Clara County, California were released.

  • Eran BendavidBianca MulaneyNeeraj SoodSoleil ShahEmilia LingRebecca Bromley-DulfanoCara LaiZoe WeissbergRodrigo SaavedraJames TedrowDona TverskyAndrew BoganThomas KupiecDaniel EichnerRibhav GuptaJohn IoannidisJay Bhattacharya (17-Apr-2020), “COVID-19 Antibody Seroprevalence in Santa Clara County, California“,

When conducting a large-scale community antibody test like this, the most important aspect of the experimental design is to make sure that the people selected for testing don’t self-select for the study. People who have reason to believe that they may have had COVID-19, or may have been exposed, will be vastly more eager to get tested than people who have no reason to think they may have had it. Ideally, you would either test everyone, or you should randomly select the participants without simply asking people to participate. If people self-select, this self-selection bias will result in a population of test subjects who are far more likely to have had the disease than the general population.

Today’s study is a very important one, and the researchers did a lot of other things right, but unfortunately they let participants self-select, so I believe that their reported results have overestimated the true prevalence in Santa Clara County. In this posting, I examine how to adjust their results to account for the self-selection bias.

How participants were selected

The following description of how participants were recruited is taken directly from their paper:

We recruited participants by placing targeted advertisements on Facebook aimed at residents of Santa Clara County. We used Facebook to quickly reach a large number of county residents and because it allows for granular targeting by zip code and sociodemographic characteristics. We used a combination of two targeting strategies: ads aimed at a representative population of the county by zip code, and specially targeted ads to balance our sample for under-represented zip codes. In addition, we capped registration  from overrepresented areas. Individuals who clicked on the advertisement were directed to a survey hosted by the Stanford REDcap platform, which provided information about the study.

Basically, they put out a targeted ad on Facebook. Although it doesn’t say so here in the paper, I have heard second hand from people here in Santa Clara county that the ad offered a $10 Amazon gift card to participate. They targeted the ad to a subset of the population in each zip code.

They made substantial efforts to obtain a representative sample from the different zip codes and demographic groups within the county, and then after the proportions of participants didn’t quite match these distributions, they adjusted their results for this aspect of sampling bias. Their attention to this aspect of sampling bias was good. But they made no adjustments for self-selection bias.

All subjects’ blood were drawn for testing on 3-Apr-2020 or 4-Apr-2020 in three drive-through test sites in Los Gatos, San Jose and Mountain View.

The study’s results

After eliminating some subjects for various technical reasons, the study ended up with 3,300 people with test results. Of those, 50 were positive test results, constituting 1.50% of the tests. After adjusting for demographic and geographic sampling biases, they adjusted this to an estimate of 2.81% positive.  They then adjusted for the accuracy of the test (the actual accuracy is uncertain) to come up with a final estimate of prevalence between 2.49% to 4.16% percent, where the range is due to assumptions about the test’s accuracy. This prevalence translates to between 48,000 and 81,000 people in the county, which is 50-85 times the number of confirmed cases. This led to an IFR of 0.12% to 0.2%.

The results I just listed are directly from the paper. None of these results are adjusted for self-selection bias.

Self-selection adjustment

Suppose that people who have previously had COVID-19 are 10 times more likely to sign up for a test than people who have never been infected. This is called the likelihood ratio, which I’ll denote as L.   In this example, L=10.

Let C denote whether a person has had COVID: C=true if they have, C=false if they haven’t. Let T denote whether that person enrolls in the test: T=true means they enroll, T=false means they don’t. With this notation, Bayes’ rule can be written as

    \[ odds( C | T ) = odds( C ) \cdot L \]

Where odds(C|T) is the odds a person had COVID-19 given that they take the test, and odds(C) is the odds a person from the general population had COVID-19. The study determined odds(C | T), whereas what we care about is odds(C), which doesn’t have the self-selection sampling bias.

The study estimated the prevalence, p, which is related to odds as odds(C|T) = p / (1-p). With some simple algebra, the true prevalence is obtained by multiplying the study’s estimate by an adjustment, \alpha, given by

    \[ \alpha = {1 \over { L (1-p) + p }} \]

This adjustment is shown in the following graph as a function of the likelihood ratio.

To adjust for self-selection bias, multiply the study results for prevalence or number of infections by this self-selection adjustment factor. For example, if you think a person who was previously infected is 10 times more likely to participate in the study, use \alpha=0.1.

To adjust for self-selection bias, you need to estimate how many times more likely a person who had previously had COVID-19 would be to participate in the study, compared to a person who has never had it. This is the likelihood ratio. If you think that person would be 2.5 times as likely, use the self-selection adjustment for L=2.5 from the above graph, which is \alpha=0.4, and then multiply the study’s prevalence number by 0.4, which yields a prevalence between 1.00% and 1.69%. You can also multiply by \alpha to adjust the total number of infected people, which for L=2.5 is between 19,500 and 33,000.  To adjust the IFR estimate, divide by \alpha, which yields an adjusted IFR estimate between 0.29% and 0.49%.  Adjusted values for these results are shown in the next four graphs as a function of likelihood ratio, the two lines showing the lower and upper estimates.

Prevalence estimate in Santa Clara County after adjusting for self-selection bias.
Estimated number of previously infected people in Santa Clara County after adjusting for self-selection bias.
Estimate Infection Fatality Ratio after adjusting the study’s results for self-selection bias.
Even after correcting for self-selection bias, there appears to be many more cases than reported.

Estimating the Likelihood ratio

To adjust the study’s results, you have to estimate the likelihood ratio — how many times more likely someone previously infected would be to participate compared to someone never infected. Of course, this is also a big unknown.

Because the person considering whether to register doesn’t know whether she has COVID-19 or not, you might find the likelihood ratio to be overly abstract. It is easier to compare how much more likely it would be for someone who had symptoms at some point to participate in the study than someone who never had any symptoms. Of course, you might also want to consider the possibility that a person may be more likely to register because they know they had come into contact with someone else, or because their job makes them more vulnerable, etc. Because there are these other possibilities, I chose not to decompose the likelihood ratio into other estimations.

I live in Santa Clara County, and I was aware that the study was taking place. Many of my family’s friends were also aware that it was taking place, and we know of at least one person who tried very hard (unsuccessfully) to find the ad so that she could participate, because as she said, she had been sick and wanted to know if she had had COVID-19. Hence, based on my own experiences, my personal guess is L=5. But there is nothing magic about my guess, yours may differ.

Summary

The results from today’s study may lead some people to conclude that COVID-19 is no worse than the common flu. But this adjustment for self-selection bias show that this is not true.  As the graphs above show, the prevalence drops off quickly when adjusted for even a small self-selection bias.

Understanding how many people in the population have had COVID-19 is an extremely important parameter for mathematical models of the pandemic. It is a critical piece of knowledge for determining how deadly a COVID-19 infection really is, for projecting hospital capacity and how many people may die, and for deciding when the economy can be reopened.  Large scale tests of the population, like the Stanford study published today, help us estimate the prevalence of COVID-19 in the population.

When conducting a large-scale prevalence study, it is important to eliminate self-selection bias. When people are allowed to decide whether to participate in the study, those who have reason to believe that they are more likely to have had it will be more likely to participate. This may include people who had suspicious symptoms at some point, who know they were exposed at some point, or who work in situations that put them at greater risk of infection. In this article, I discussed how we can attempt to adjust for self-selection bias in the large-scale Santa Clara County study that was published this morning.

 


Share Now
Subscribe
Notify of
guest
30 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Paul
Paul
7 months ago

good point! have you had any reaction from the study authors?

Lonnie Chrisman
7 months ago
Reply to  Paul

Hi Paul. As far as I know, not yet.

Rodger Bodoia, MD, PhD
Rodger Bodoia, MD, PhD
7 months ago

I agree. Their methodology was deeply and unnecessarily flawed. You raised good motivators for self-selection: having felt ill, having been around someone who felt ill, being in a higher risk profession. Also, note the bias that is inherent in the method of using Facebook as the messenger with a brief period between posting on FB and the actual testing. We would need significant information on the other behaviors of people who use FB this frequently and whether they are more or less likely to have engaged in practices that would have put them at risk of acquiring the virus. Back… Read more »

Andrew W
Andrew W
7 months ago

The obstetric patient study is really interesting. I think the more revealing statistic there was the identification of 22 completely asymptomatic women, with only 7 showing symptoms (4 plus 3 who subsequently developed fever, though some may have been postpartum endometriosis). This suggests that only 25% of infected individuals may exhibit symptoms. Of course, the immune systems of pregnant women at term are “interesting” (e.g. activated CD69+ NK cells are elevated, for example), and they tend to be young and fit (I wouldn’t consider pregnancy to be a co-morbidity). I would therefore be surprised if that 3:1 ratio of asymptomatic… Read more »

Grace Ann Pixton
Grace Ann Pixton
7 months ago

Dr. Chrisman, Thank you for working through a potential correction to the selection bias present in Dr. Bhattacharya’s study. This model illustrates the impact that volunteer selection bias could have on the study result. As you suggest, your assumption of L=5-10 is fairly arbitrary and does not necessarily aid in demonstrating a more accurate prevalence estimate. The survey data from Dr. Bhattacharya’s study that reported previous symptoms of volunteers could be compared to Google’s symptom tracker for the county. This would provide a fairly accurate way to adjust for volunteer bias. I would be interested in seeing the effect that… Read more »

Yigal B.
Yigal B.
7 months ago

Lonnie, I believe you raised an excellent point regarding the bias.
However something bothers me in the first equation: L is defined as P(T|C)/P(T|C=false)
However Bayes:
P(C|T) = P(C)* P(T|C)/P(T)
So some redefinition of L may be needed here.

Andrew W
Andrew W
7 months ago

Hi Lonnie, Spot on analysis, and looks like you might have extracted a more plausible number from the Bendavid et al data. I would also note that the serological test measures anyone who has ever been infected (i.e. in the last 120 days or so; though it can plausibly inform about recent infections [IgM] versus historic infections [IgG]), whereas the PCR test measures a snapshot of those that are currently infected. One would need to do some more fancy math(s) here, but this would also factor in to a better assessment of the real mortality rate. Remember that at least… Read more »

Donald Cole
Donald Cole
7 months ago

great work…much appreciated

Shankar Kurra
Shankar Kurra
7 months ago

Chris excellent analysis! Thanks

William A. Roper, Jr.
William A. Roper, Jr.
7 months ago

Thank you very much for the thoughtful analysis regarding self-selection bias! I would only add that the tendency to self-select seems to me very likely to relate not only to (a) those who experienced COVID-19 like symptoms (perhaps those who had a bout with un-diagnosed influenza or something else) and (b) a known contact with a person known to be infected, but also those who (c) have frequented those places associated with known cases and/or (d) know someone who was infected even if they had no direct contact with such person since the outbreak began. These latter might be more… Read more »

john ryan
john ryan
7 months ago

Great analysis Lonnie. I too had the sense that doubling or tripling prevalence estimates based upon demographic data when such a small number tested positive is likely to significantly compound any errors generated by self-selection as opposed to randomized selection of participants. It was pointed out to me that because asymptomatic cases can be a significant portion of positive cases, this fact may ameliorate some of the self selection bias. I do not know how to integrate that factor into the adjustments the study team made. Another important point, accuracy of the anti-body tests has been called into question and… Read more »

john ryan
john ryan
7 months ago

The fact that study leaders Drs. Bendavid & Bhattacharya did not reveal their previous analysis of the Vo, Italy widespread test results stating infection prevalence was so high in Italy (3%) that CFR was only .06% in the discussion section of the SCC study indicates they did not want this fact revealed prior to publication. Drs. Bendavid & Bhattacharya also posited in the March 24 WSJ Op-Ed piece titled “Is Covid-19 as Deadly as They Say? that the total mortality of CV-19 in this country would be between 20K &40k, the high end be eclipsed in this country today. Study… Read more »

Jesse
Jesse
7 months ago

A wise self reflection. Calling out technical challenges in producing high integrity sampling is fair. Imputing perverse motivations on its authors due to a few speculative accusations is amateur. As it stands, the Santa Clara study has provided valuable insight at a time when any sense for the true progression of the virus is needed to support policymaking.

Tom
Tom
7 months ago

While I fully understand why you corrected yourself and applaud you for that, I unfortunately had the opposite conclusion. I initially thought the researchers were well-intentioned but poorly trained scientists who were a bit too eager to publish their results. But the same group did a similar study in LA county and went on a press tour. And their preprint could only be found in redstate.com, a far right website. (you will understand what I mean by far right when you go there). There is a somewhat innocuous explanation for this. (they released the preprint to hundreds of media websites… Read more »

BrianB
BrianB
7 months ago

I don’t have a PhD after my name but I do find it kind of a headscratcher that a criticism of the accuracy of a paper includes an estimate based primarily on “Many of my family’s friends were also aware that it was taking place, and we know of at least one person who tried very hard (unsuccessfully) to find the ad so that she could participate, because as she said, she had been sick and wanted to know if she had had COVID-19”. You’re right that your guess isn’t magical but it is pretty comical. Considering the universal hype… Read more »

James Mitchell
James Mitchell
7 months ago
Reply to  BrianB

HI Brian.. From personal experience I totally agree with your assessment that self selection may have gone the other direction. In Alberta where testing is now available, people are asked to register for testing based on a self assessment that includes runny nose and coughing.. for those who are freaked out, any sniffle is Covid. This might be just my availability bias but in my view there is equal likelihood that the paranoid signed up in droves.

kpkinsunnyphiladelphia
kpkinsunnyphiladelphia
7 months ago

Lonnie, very nice analysis. To be fair, Bendavid and his co-authors conceded that self-selection might be a problem: they wrote. “Other biases, such as bias favoring individuals in good health capable of attending our testing sites, or bias favoring those with prior COVID-like illnesses seeking antibody confirmation are also possible. The overall effect of such biases is hard to ascertain.” So yes, you’re right, the study is not methodologically rigorous when it comes to recruitment strategy, but the bias can work both ways. Perhaps the results actually undercount the percentage of people infected, because it drew a bunch of uninfected… Read more »

kpkinsunnyphiladelphia
kpkinsunnyphiladelphia
7 months ago

Oops!! Forgot the extra zero! That will teach me not to do arithmetic late at night. I meant 0.030% or around 100K+ deaths — including a pre-vaccine resurgence later in the year (though maybe it will be like H1N1, where we got a vaccine in 9 months). We’ll probably be south of 60K in this phase. So it will be worse than H1N1, but not 1918. And again, we’d have to stratify that CFR by age and age/co-morbidity combinations. For someone over 70 with COPD, your chances may be 50-50 if you catch it. The other thing I wanted to… Read more »

kpkinsunnyphiladelphia
kpkinsunnyphiladelphia
7 months ago

Thanks for that thoughtful reply, I will look at all of those. If you’re right, then we are damned if we do and damned if we don’t. The virus has to run its course, now, or later, but eventually. Who will win the policy decision then? If it’s Zeke Emmanuel, with his “it’s going to take 18 months,” at one extreme, we will destroy economic society as we know it. And even if we delay, say, another two months, and then open up, the consequences may still be catastrophic from BOTH a societal/economic AND and epidemiological point of view. Or,… Read more »

GerryC
GerryC
7 months ago

It’s possible to be sensitive to that colliding venn diagram, where you’re concerned about suppressing COVID-19, worried about the economic impacts, and concerned about government intervention/suppression/imposition. However, if we don’t suppress (r<<1) significantly first, what are the consequences of reopening, and seeing a large second wave of infections. Most of us on the epidemiological side of things are concerned about that second wave, especially in the more rural areas that haven't yet spiked. I fully anticipate we'll see areas reexert high case loads, and I wouldn't be surprised to see new epicenters of infection. What will be devastating is if… Read more »

Scroll to Top