site banner

Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated | Scientific Reports

nature.com

PCA did not produce correct and\or consistent results across all the design schemes, whether even-sampling was used or not, and whether for unmixed or admixed populations. We have shown that the distances between the samples are biased and can be easily manipulated to create the illusion of closely or distantly related populations. Whereas the clustering of populations between other populations in the scatter plot has been regarded as “decisive proof” or “very strong evidence” of their admixture, we demonstrated that such patterns are artifacts of the sampling scheme and meaningless for any bio historical purposes. Sample clustering, a subject that received much attention in the literature, e.g., Ref., is another artifact of the sampling scheme and likewise biologically meaningless (e.g., Figs. 12, 13, 14, 15), which is unsurprising if the distances are distorted. PCA violations of the true distances and clusters between samples limit its usability as a dimensional reduction tool for genetic analyses. Excepting PC1, where the distribution patterns may (e.g., Fig. 5a) or may not (e.g., Fig. 9) bear some geographical resemblance, most of the other PCs are mirages (e.g., Fig. 16). The axes of variation may also change unexpectedly when a few samples are added, altering the interpretation.

Specifically, in analyzing real populations, we showed that PCA could be used to generate contradictory results and lead to absurd conclusions (reductio ad absurdum), that “correct” conclusions cannot be derived without a priori knowledge and that cherry-picking or circular reasoning are always needed to interpret PCA results. This means that the difference between the a posteriori knowledge obtained from PCA and a priori knowledge rests solely on belief. The conflicting PCA outcomes shown here via over 200 figures demonstrate the high experimenter’s control over PCA’s outcome. By manipulating the choice of populations, sample sizes, and markers, experimenters can create multiple conflicting scenarios with real or imaginary historical interpretations, cherry-pick the one they like, and adopt circular reasoning to argue that PCA results support their explanation.

...

Indeed, after “exploring” 200 figures generated in this study, we obtained no a posteriori wisdom about the population structure of colors or human populations. We showed that the inferences that followed the standard interpretation in the literature were wrong. PCA is highly subjected to minor alterations in the allele frequencies (Fig. 12), study design (e.g., Fig. 9), or choice of markers (Fig. 22) (see also Refs.57,68). PCA results also cannot be reproduced (e.g., Fig. 13) unless an identical dataset is used, which defeats the usefulness of this tool. In that, our findings thereby join similar reports on PCA’s unexpected and arbitrary behavior. Note that variations in the implementations of PCA (e.g., PCA, singular value decomposition [SVD], and recursive PCA), as well as various flags, as implemented in EIGENSOFT, yield major differences in the results—none more biologically correct than the other. That the same mathematical procedure produces biologically conflicting and false results proves that bio historical inferences drawn only from PCA are fictitious.

I highly recommend reading the entire article. It is quite detailed. They do PCA analyses with a toy model using colors with admixture and show that choice of inputs can yield an admixed population (the color Black) arbitrarily close to any of its component mixtures (Blue, Green, or Red) on a scatter plot of their principle components. They also go through data sets of some other population genetics studies and show how using those data sets can generate conflicting PCA results depending heavily on the researchers choice of inputs.

-3
Jump in the discussion.

No email address required.

Applying principal component analysis (PCA) to a dataset of four populations sampled evenly: the three primary colors (Red, Green, and Blue) and Black illustrate a near-ideal dimension reduction example.

Note: colours are represented as RGB, from 0 to 1 instead of 255.

Although PCA correctly positioned the primary colors at even distances from each other and Black, it distorted the distances between the primary colors and Black (from 1 in 3D space to 0.82 in 2D space).

No shit. What this means in terms of genetics is that if you have 3 source populations A B C, and A and B are relatively genetically similar, say only differ on 100 allels, and C is very different from them, differing on 1000 allels, and then you do PCA on various populations that are mixed out of those, the PCA plot wont tell you that A and B are similar. It will only tell you the relative admixtures of A B C in the sampled populations. Of course, this is often exactly what you want.

Its an especially bizzare complaint since the "allel distance" can already be calculated from the raw data, and represented easily, without doing PCA or anything of that sort.

Box 1: Studying the origin of Black using the primary colors

Same dataset as before, except they change the relative sampling frequency of the colours, and show that this can change which primary colour black is closest to.

The problem here is that the author seems to think that black "should" come out an even mixture, but it isnt. Genetic mixture is weighted averaging, and that even mixture would be (1/3, 1/3, 1/3), a darkish grey. Black cant be any kind of mixture of the others. In all these plots you see a pyramid rotated to various angles.This is not a "near-ideal dimension reduction", that would look flatter in 3D. In fact you could also make it so that black is outside the triangle of the other 3, dont know why he doesnt show that.

What this tells us genetically is that a plot of n PCs can only depict the relative admixture of n+1 source populations. If there are more, youll only see those n+1 most important to your sample (ideally, if the others dont matter very much), or rotations between that (less ideally). So you do often need to look at more than 2 PCs, but if the samples are actually admixed from a few sources, you wont need very many.

Also notice that the shift in relative sample size needed to create these changes in the example is multible orders of magnitude, and the more "reducible" the data is the more drastic they need to be.

Based on this we get:

Reich et al.44 presented further PCA-based “evidence” to the ‘out of Africa’ scenario. Applying PCA to Africans and non-Africans, they reported that non-Africans cluster together at the center of African populations when PC1 was plotted against PC4 and that this “rough cluster[ing]” of non-Africans is “about what would be expected if all non-African populations were founded by a single dispersal ‘out of Africa.’” However, observing PC1 and PC4 for Supplementary Fig. S3, we found no “rough cluster” of non-Africans at the center of Africans, contrary to Reich et al.’s44 claim... This is an example of how vital a priori knowledge is to PCA.

You could also tell a priori knowledge was used because youre shown PC1 and 4. Someone looked at more than that and chose to show you these. This is also why its dumb to show you PC1&4 from his replications with changed relative samplesizes: If you looked at more PCs (and probably not many more) youd find a combination that replicates Reichs plot or something similar, because again youre just rotating your perspective on the same shape. (Or not, see below.)

Also if you read the Reich cite (I recommend if you have access (arrr), its only 2 pages), youll see its primarily about integrating PCA with other sources of information, and that plot is an example. Also also he seems to have done the PCA on Africans only, and then plotted the others into those PCs, whereas the author is doing PC on all of them.

Box 2: Studying the origin of Black using the primary and one secondary (admixed) color populations

Following criticism on the sampling scheme used to study the origin of Black (Box 1), the redoubtable Black-is-Red group genotyped Cyan.

The RGB for Cyan is (0, 1, 1), which is again not a mixture of previous colours. Black, Blue, Green, and Cyan might be mixtures of 3 other colours that lie outside the RGB space, but thats still 4 source populations.

Pretty much the same goes as I said for Box 1.

This is where i stopped reading.

Yeah, PCA on its own is not useful for conclusions in genetics. It's a good guide to telling you what to investigate further but it stops there. F-statistics are very strongly related to PCA and don't have this bias mentioned here. They're far more useful for actual reproducible conclusions.

They do PCA analyses

The linked paper has one author, who is a man.

Honestly I’m staggered (or maybe not) that the editors didn’t cut out all the emotive language that makes it sound like he’s in a Twitter argument (using the term ‘PCA disciples’) and the fact he finished off the whole paper like an advert for his crappy methods called GPS. Like he literally only recommended using his methods as the end. It’s kinda juvenile and embarrassing. Most of his conclusions are well known, nobody relies only on PCA - he’s just constructed a massive straw man at every point to try and justify it. It honestly sounds like he has some beef with David Reich or something and wanted to write a bit piece against him.

Of course this guy is a well known crank, pushing for a Khazar origin of Ashkenazi Jews and considers the Yiddish language as a Slavic-Turkish creole language or whatnot, so I’m barely surprised. It’s a shame though because there is room for a paper to critically evaluate the use of PCA in pop gen, just ….. this really isn’t it.

On Twitter, Graham Coop pointed out that Gil McVean published an analysis of PCA in 2009 where he discussed similar issues. If you want to know more about the caveats of PCA, ignore this Scientific Report paper and read Gil's work – that is a whole other level.

the Yiddish language as a Slavic-Turkish creole language

What ? How does that make sense with Yiddish having so many German words ?

A hostile article about Mr. Elhaik's article, the actual article co-authored by Mr. Elhaik: PDF