Coronavirus: up until May 1

A vast amount of research has started to understand how human genetics interacts with the new coronavirus. 

The below is written with what is an ever growing interest for me in my work in bioethics: the significance of individual biological difference, and how we let this influence our lives; and the difficulties that stem from the use of the population concept. I have drawn attention below to the all-too-easy-to-make differences between populations that are appearing in the literature.

Humans react very differently to infection with SARS-CoV-2. Some of this is because of genetic differences. Understanding these differences could help our understanding of disease processes — severity, outcomes, how the immune system interacts with the virus. This could lead to better disease management, possibly by suggesting therapeutic approaches. It could also help us understand patterns of infection, possibly informative for vaccine development. These use cases are the biggest motivation. They don’t rely on any of us learning our own relevant genetic information.

Understanding what makes some people more and less susceptible to severe disease could also be used directly. Genetic information is often used in the prescription of pegylated interferon α and ribavirin for chronic Hep C infection. Finding who is least susceptible could be useful for clinical trial design, particularly if we opt to purposefully infect research subjects in a challenge trial design. Finding who is most susceptible could help identify who has to take extra precautions. Recognizing that a diverse array of genotypes can have large impacts could ensure that individuals with all those genotypes are represented in therapy trials and vaccine trials, so we know that results generalize.

Before we turn to human genomics, what do we know about the virus’s genome?

Before SARS-CoV-2, 6 human coronaviruses were known: SARS, which killed 774 of 8096 infected in 2002-3; MERS, which killed ~600 of ~2000 infected starting in 2012; and four others that cause milder symptoms, which collectively account for a third of all colds. There are also hundreds of coronaviruses that infect other animals, notably bats. In 2013, a bat virus very close to SARS was identified, suggesting direct transmission from bats.

The first genome of the new coronavirus was published on January 10: ~30,000 bases, 14 ORFs encoding 27 proteins. The NYT produced a beautiful tour of the viral genome.

On 21 January, a group posted to the bioarxiv a theory that the virus may have been engineered in a lab. Others pointed out mistakes in the analysis, and it was quickly retracted.

The first report of viral sequences from nine Wuhan patients showed 99.8% sequence identity to each other, and hence a recent common origin. It also showed sufficient divergence from SARS-CoV to be classified as a different virus. 

Based on homology to SARS-CoV, SARS-CoV-2 was predicted to also use the human protein ACE2 to enter cells. The receptor binding domain on the spike protein has a strong effect on infectivity. (It is known which mutations might lead to greater infectivity; it is hoped these will not be selected for.)

Multiple sequence alignment has been used to trace how the virus is spreading. This is a key technique in “precision public health” which relies on using slight changes between pathogen genomes to trace how the infection spreads through populations. See a global effort called NextStrain— with lots of beautiful graphics. A recent (but non-covid) review of pathogen genomics. The CDC has been a late adopter compared to e.g. the UK, but there are several very large scale efforts using next generation sequencing for pathogen tracking in the US. The UK announced a £20m effort to track the pathogen using viral genomics. Another great graphical piece from the NYT showing how the virus has gained mutations, and how this can let us track the virus.

An April 29 preprint showed two clades, six subclades, and some evidence of convergent evolution to affect how strongly the spike protein binds to ACE2. Mutations in some subclades could evade some current tests.

(Of interest: As reported by MIT Technology Review, 20 years ago it was demonstrated that, starting with viral DNA, a virus would “reboot” in a cell. That meant that as soon as the viral genome was published, any lab with access to something that could “print” DNA, and that had the know how to go from there, could make the virus (and any variant thereof). Only a few places can print DNA. They can choose whether to fulfill an order. Part of the process is that they compare incoming orders with a database of known pathogens (e.g. polio), to give one layer of control.)  

Host genomics

COVID-19, the disease that can result from infection with SARS-CoV-2, is a product of the virus, the host, and the environment. The genome of the host can affect the impact of a pathogen, with both rare and common human variation known to play a role. The most famous example is the allele for sickle cell anemia: heterozygotes are protected from malaria. Another famous example is CCR5‐Δ32 which confers resistance to some HIV strains (this was the edit made to the first genetically engineered humans, born in 2018). 

A 2017 review stated “Despite their limited application in the field, GWASs [genome wide association studies] have provided valuable insights by pinpointing associations to both innate and adaptive immune response loci, as well as novel unexpected risk factors for infection susceptibility.” It also stresses that heterogeneity across populations is particularly important in the setting of host response to infectious disease. Things also get complicated by the presence of viral strain – host genotype interactions. We have evidence from this in HepC.

A 2016 study titled “Genetic Ancestry and Natural Selection Drive Population Differences in Immune Responses to Pathogens” showed that, when exposed to listeria and salmonella, many genes showed differential expression in white blood cells between individuals of European ancestry and individuals of African ancestry. They traced these differences to genetic variants, and further argue that the differences between populations are due to recent natural selection.

How important are genetic differences in understanding the differences in how SARS-CoV-2 affects individuals? An early estimate of the heritability of susceptibility to SARS-CoV-2 infection puts it at about 50%. This is hard to estimate as there are so many confounders, for example poverty.

The standard tools for identifying which genetic variation makes a difference are genome wide association studies. A preprint shows that early variants that show up as genome wide significant in the UK are associated with higher educational attainment and better health, and hence probably capture who was traveling, rather than anything about disease biology.

Which genes are known to be involved in how SARS-CoV-2 affects the body? 

What human variation has already been identified that makes a difference for COVID-19?

  • ACE2
  • The Interferon Lambda Region has a role in controlling the expression of ACE2. Polymorphisms in the region have been linked to ACE2 expression levels in diseased tissues (preprint from an Oxford group). Variation in this region has previously been associated with Hep C infection outcomes (but with effect varying by virus genotype). The protective allele is found more frequently in East Asians and its absence in those of African ancestry. They conclude “the overall impact of this polymorphism on the clinical course should be assessed, especially given the very variable distribution of IFNL4 alleles in different ethnic groups”
    • A preprint from Italy found different haplotypes between East Asians and Italians, with two rare alleles of interest predicted to induce higher levels of TMPRSS2 in Italians. One suggesting possible regulation through androgens (and hence possibly linked to sex differences), and the other already linked to increased susceptibility to flu.
    • A preprint from the LungMap consortium found evidence of increased expression of TMPRSS2 with aging, and identified a regulatory SNP that contributes to expression levels.
  • Interferon-induced transmembrane protein 3 (IFITM3) – a variant linked to more severe disease. The study from China compared mild to severe cases and found the homozygous variant rs12252 was much more common in the severe cases (p = 0.00093; OR = 6.37). The variant was previously associated with flu severity, and is found at much higher rates in those of East Asian ancestry (e.g. carried by ~26% of the Beijing population). 
  • HLA proteins. A class of human leukocyte antigen (HLA) proteins sit at the cell surface, presenting short sections of protein (peptides) for recognition by T-cells. If the T-cells recognize non-self, they react appropriately. Which peptides are shown? That depends on what’s in the cell — if a virus is present, it can include sections of virus proteins. And it depends on the precise structure of the protein’s binding grove, which is hyper-variable between humans. The extent to which an individual’s HLA cells have binding groves that bind bits of the viral proteins could therefore affect the body’s immune response is. 
  • Blood groups (A more susceptible, O less so)
  • Meanwhile, polygenic scores for disease severity have already been prouced

Several large scale studies are starting

There have also been some commitments to the ideal of data sharing


  • Some of these references are to preprints, which are not peer reviewed. Others have been rushed through printing. I am pretty confident that not all of this will stand the test of time.
  • So much is standardized in the genomics workflows, and so much of the data is publicly available, that it is very easy to put out papers that look vaguely sensible. Here is a preprint looking at ancestry differences, which I remain very unconvinced by