site banner

GWAS in 2022

The largest GWAS of all time (of all time!) dropped a few weeks ago to little fanfare, at least in these spaces. In a nutshell: 5.4 million participants measuring height and 1.4 million SNPs per participant, so about 7 trillion data points if I’m not mistaken. If you submitted 23andme samples, congratulations! You contributed to the (current) record holder for largest GWAS in history. In total, the study accounts for 40-45% of the phenotypic variance of height, and furthermore, the authors claim this is saturating: adding more samples won’t increase the fraction of heritability that they can account for.

What you can do with this data:

  1. Generate some robust polygenic scores (PGS)

  2. ‘Risk prediction’ if you have a burning desire to know how tall someone will be (with large error bars)

  3. ???

What you can’t do with this data:

  1. Understand the phenomenon of ‘height’ in any meaningful way

  2. Genetic engineering a la Oryx and Crake, which is how most people see using CRISPR to make designer babies.

  3. Develop any kind of treatment or therapeutic that would improve the human condition.

So, to put it in some context: the criticism of GWAS has always been that these studies are large, expensive, rarely teach us anything about the underlying biology and explain little of the actual heritability (‘missing heritability’ problem). The ‘mechanistic’ biologists interested in curing disease or engineering biology generally dislike GWAS. It’s interesting in the way that astrobiology is interesting; good to know that planet XYZ792 150 light years away may have liquid water on it’s surface, but not really of practical use. What they (and I, being very much of this pedigree) missed is that PGS are of use if you’re in the business of embryo selection and I was corrected on that point a few years ago (conversation here if you want to see me being wrong). So if your goal is having really tall (or short!) children, this paper is good news for you, but you’ll probably still be dissatisfied with the current low-throughputness of embryo selection.

That being said, these criticisms are still salient and, to some extent, I think have been validated: saturating the SNP space with an absurd number of samples (for context: there are only 1.5 million Americans with type 1 diabetes! Good luck saturating that GWAS in our lifetime) only explains 45% of the variance, and this number will undoubtedly vary from trait to trait. Presumably the rest is coming from rare variants (the cutoff in this study is a minor allele frequency (MAF) of < 1% which is quite high), structural variants, or some genetic dark matter implying that our heritability estimates are too high or not being driven by DNA (?).

I think this also has something to say about the omnigenic model. Even with a very high-powered study most of the SNPs are still clustering around genes with known functions related to growth, bone structure, etc. About a third aren’t near anything at all and we have no idea what they might be doing. But again, the low heritability explained would argue that rare variants may play a much larger role than previously appreciated, which may hew closer to Jim Lupski’s Clan Genomics model. And, this is much more speculative, but perhaps this is hinting at the biological underpinnings of ‘interindividual variation is larger than population level differences,’ i.e., rare variants (and the rarer end of SNPs) unique to your ‘clan’ have a similar or larger effect size than the very common SNPs shared by populations. Eager to see what people think or if they have any corrections.

By the way, how does one use superscripts around these parts? Would have been useful to clean up some of these asides with footnotes. Also, how to use tilde without getting strikethrough?

6
Jump in the discussion.

No email address required.

The development of big GWAS and tools like AlphaFold suggest to me that we’re nearing the point where useful empirical information overwhelms the capacities of human comprehension.

Exactly! I enjoyed this essay quite a bit. Maybe our fate was never to truly understand biology, but build an oracle that can.

A lot of the work of medicine has been outsourced to evolution, and we’ve cribbed from her notes on every antibiotic and biologic we’ve produced. But we’re getting close to the point where we can build magic bullets from first principles.

It's an interesting question. Perhaps the antibiotic discovery space has been completely saturated by Nature already, at least in terms of targets. In the late 2000s, we developed fully synthetic antibiotics never before seen in nature and bacteria developed resistance just the same. I wonder if the future will be more medicinal chemistry tricks or a pivot to something like bacteriophages...

In terms of biologics, are you referring to monoclonal antibodies? If I'm interpreting you correctly, one day having to raise the right antibody to your target will be trivial because you'll just feed the sequence into alphafold and you're done. I agree, the first person with a model capable of that is going to mint money for a while. There are still a host of other very difficult problems to be solved even at that point though; these kinds of models are only going to get us so far.

Biologics are a big category, of which the -mabs are the early success story. The next evolution would be exactly what you described, where we can construct a protein to block targets by way of a fancy ml chemistry algorithm instead of trial and error. Beyond that, we get into de novo synthetic proteins that have more in common with sci-fi nanomachines than penicillin. Then, ???.