Round-up April 26 – May 28

I’ve started this round-up with recent papers focused on two scientific themes that will dominate the near term progress in understanding links between genotype and phenotype, 1) trouble ahead for polygenic scores, and 2) the coming together of rare and common variation analysis.

Trouble for polygenic scores

Previous work on the genetics of human height, based on the GIANT consortium, had identified evolutionary adaptation signatures to explain the North-South height gradient. But two new studies applying the same methodology to the more homogenous and larger UK Biobank data found no such evidence. They did find that the same SNPs were identified, and with similar effect sizes. But population structure biases these effect sizes. This is particularly problematic on meta-analysis that combine heterogeneous data sources. And is much worse if sub-significant SNPs are included. This should cause extreme caution when a) looking for signals of polygenic adaptation, b) between-population differences. Additionally, this population structure can be “an additional source of error in polygenic scores and affect their applicability even within populations.” (paper 1). As “even small differences in ancestry will be inadvertently translated into large differences in predicted phenotype” (paper 2). These results are nicely put in context here, with a quote from a former teacher of mine, “The methods developed so far really think about genetics and environment as separate and orthogonal, as independent factors. When in truth, they’re not independent. The environment has had a strong impact on the genetics, and it probably interacts with the genetics,” said Gil McVean, a statistical geneticist at the University of Oxford. “We don’t really do a good job of … understanding [that] interaction.”

Question: what would the same analysis show for the Educational Attainment polygenic score? Which stands as the other score based on very large heterogeneous data, and utilized many non-significant SNPs.

A separate preprint shows how, even within an ancestry group, porting polygenic scores has its challenges. “The prediction accuracy of polygenic scores depends on characteristics such as the age or sex composition of the individuals in which the GWAS and the prediction were conducted, and on the GWAS study design.”


Coming together of rare and common variation.

We’re getting to the stage of having very large cohorts of NGS data. What will we learn of how common and rare variation jointly contribute to disease? And what implications does this have for the clinic?

An exome cohort of over 20,000 T2D cases and 24,000 controls, representing one of the largest studies yet using NGS data. For 76% of their cohort they also had array data plus imputation. The broad relevance of this type of study lead me to read this paper fairly closely. What did they find?

  • Looking exome wide, of the 6.3 million variants in their data set, 15 were exome-wide significant. They were powered to find variants with an effect size of OR 2.5 at a frequency of 0.2%
  • They aggregated to the gene level and found 3 significant genes. Looking at the near misses in other datasets leads them to think that these will become exome wide significant in the future. They estimated that the top 100 gene level signals would capture a mere ~2% of the genetic variance in their sample.
  • Then they aggregate another level up, at the gene set level, only drawing the weak conclusion that this line of work “can be used as a potential metric to prioritize candidate genes relevant to T2D.”
  • They found almost all the variants they had found in the exome data in the array data (8 of the 10 single variants), and then 14 more non-coding in the array data. The vast majority of their overall variants were not imputable. Because the array data identified common variants, it explained more of the genetic liability
  • The basic issue is that they continue to be underpowered to a) find rarer variants (<<0.2%), b) accurately estimate their effect sizes. They suggest, as an antidote to (a), relying on prior suspicions of a gene-disease connection to narrow search space (and hence lower threshold for detecting significance).
  • They conclude that for research, GWAS are best for “locus discovery and fine mapping”, and NGS for gene characterization and confidence in gene-disease connections.
  • And for personalized medicine, the very rare variants of large effect sizes may be useful, but these are so rare as to complement (rather than replace) polygenic scores based on array data.



  • Large datasets such as the UK Biobank are showing that a lot of the candidate gene work was spurious. Here is a piece from Ed Yong focusing on SCL6A4 and its connections with depression, which about 450 papers investigated. Now many are claiming that there is no evidence that the connection exists (and indeed that this has been clear for years now). But some are saying that we know the effects of this gene depend on the environment, and the new studies do not measure the environment anywhere near as accurately as needed.
  • Sarah Zhang at the Atlantic points out another enduring legacy of some of the candidate gene work. A gene called MTHFR was associated to adverse results following smallpox vaccination in a small 2008 study. Just like other candidate genes, it hasn’t stood the test of time. But MTHFR is the single gene that 23andMe gets the most questions about, by Anti-Vaxxers hoping to find their child has a variant that will get them a medical exemption from vaccination (to do this they have to download their raw data and upload it somewhere else).
  • In a preprint Plomin et al argue on the basis of ~7000 twin pairs for the existence of a substantially heritable (50-60%) p-factor, polygenic general psychopathology factor
  • A study in PNAS and a write-up in the NY Times locates several more cases where an extreme difference in smell perception can be linked to single SNPs.
  • A polygenic score for obesity, with those in the top decile 13kg heavier than those in the bottom decile by age 18.
  • A large (~30k cases, ~170k controls) GWAS of bipolar disorder identifies 30 genome wide significant variants. One scary thing: their first analysis was of a subset of the data (20k cases, 31k controls), in which they found 19 variants, 8 of which did not replicate in the combined analysis.
  • Most people who are at a 50% risk of developing Huntingtons disease do not want to know if they carry the genetic variant. Why? A study based on data from 1999-2008 found the two biggest reasons were no effective cure/treatment (66%) and inability to undo knowledge (66%).






Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s