Although phylogenomic studies are based nowadays on entire genomes or genome-scale data and hence an unprecedented amount of data, this does not necessarily guarantee that no problems occur in the reconstruction of the tree of life. Frequently, systematic errors in the dataset are amplified, resulting in increased certainty for incorrect groupings (i.e., significant false positives). These systematic errors comprise several different factors such as increased mutation rate in some animal groups or genes and not in others, saturation, compositional biases in the sequences between taxa or degree of missing data. Modern phylogenomic studies address these biases and try to minimize their negative impact on the reconstruction of the phylogeny in a bias-by-bias manner.
This method has solved some uncertainties, yet it is not successful for every case. For some cases (e.g. relationships of comb jellies and sponges to the other animals), the outcome of this procedure does not always reconstruct the same phylogeny. Hence, it becomes subjective which bias is regarded as being more important in this case and consequently which phylogeny to be trusted. This is not a satisfying scientific situation and a proper methodology to solve such conflicts objectively is required. Recent views suggest that the aforementioned confounding factors are not entirely independent of each other and partially co-vary with each other. However, this correlation is not necessarily positive and linear, but sometimes counterintuitive. For example, addressing one bias for one part of the tree might increase and strengthen biases in another part of the tree. One consequence of this is that different strategies are required to address the biases simultaneously in the tree. The problem being faced here is thus a multifactorial optimization problem.
Aim of this Master project:
The major aim in this Master project is to address the aforementioned problems by establishing a new approach. Instead of treating each bias independently and consecutively (one after another), all biases will be assessed at the same time using multivariate statistics such as principal component analyses (PCA). These methods allow that all biases are assessed simultaneously including their covariation. Different analytical strategies like using raw measurements of gene statistics or specific measurements of biases, hierarchical clustering approaches with different threshold values and so forth will be compared in the course of the project.
This project setup will be tripartite. First, the student will test the different analytical strategies using simulated phylogenomic datasets with known properties. These datasets will be used to assess the performance of the different strategies. Second, the student will then test the performance of the different strategies using empirical datasets with known problems, which have been successfully addressed in previous studies. The third and final part of the project will then be to apply the procedures to empirical datasets for which the results are still debated like the position of the comb jellies and sponges.
Supervision and teaching
The student will be supervised by Torsten Struck and José Cerca de Oliveira (NHM Oslo) as well as Patrick Kück (ZFMK, Bonn, Germany). You will be provided with a broad training in different aspects of biology, which includes, among others, bioinformatics, phylogenetic reconstructions, statistics, comparative genomics, as well as metazoan evolution and systematics. All of these techniques are state-of-the-art techniques, which are relevant for both academic and non-academic positions.
For further inquiries feel free to contact Torsten Struck.
Torsten Hugo Struck - t.h.struck@nhm.uio.no
José Cerca de Oliveira – josece@nhm.uio.no