Developing a deep learning method for phylogenomics

This project developed and tested a deep learning algorithm for phylogenomics focusing particularly on programming, implementing and validating the approach.

Current statistical methods have reached breaking point and are unable to scale adequately to the enormous quantities of data which can now be generated by high-throughput genome sequencing. This has led to either an overly simple analysis of all the data or a more biologically realistic model being used on a much smaller dataset.

Inferring trees from genome data, phylogenomics is a critical first step in many modern analyses of biological data such as trees being used to track the spread of anti-bacterial resistance in hospitals. In the medium term, this new approach should make the rapid analysis of very large datasets computationally tractable.

A machine learning pipeline for phylogenomics

We use several quick-to-compute statistics to compare the input gene sequences and summarise the distances between them. These are then fed into a neural network that has been trained to relate these statistics to an evolutionary tree relating the set of species (or genomes) being studied which is demonstrated in the diagram below:

People involved in this project

Prof Mark Beaumont (Professor of Statistics)
Dr Tom Williams (Senior Research Fellow/Proleptic Senior Lecturer)