Shotgun metagenomic sequencing has become commonplace when studying microbial communities and their relationship ship with the health of our planet, and their direct effects on our own health. Currently, there is >180,000 shotgun metagenomes publicly available, but until recently trying to treat these data as a resource has been challenging due to its extreme size (>700 trillion base pairs).
Recently we have developed a tool that can efficiently convert this base pair information into a straightforward assessment of which microorganisms are present in a sample, and what their abundances are. Searching this dataset can now be undertaken in milliseconds.
Following on from this first step of tabulating microbial community profiles from each of these datasets, we can now treat this public resource as an enormous dataset that informs our understanding of the world’s microbiomes and their properties. This project will apply machine learning tools to predict properties of these microbial communities.
The main challenge encountered previously in applying predictive algorithms is that the properties of each community (e.g. is it derived from a human faecal sample? What is the concentration of carbon dioxide?) are sometimes missing.
The first task will be to determine what set of properties have sufficient associated data available, and the second will be to train machine learning algorithms to predict these properties. Given that prediction on this scale has not previously been attempted, there is much to learn, and significant progress can hopefully be made in a short amount of time.
The work will be supported by the excellent computational resources available at the Centre for Microbiome Research, comprising >2,100 hyperthreaded CPU cores, >8 TB RAM and an NVIDIA V100 GPU spread across 9 nodes. Hardware and OS maintenance is carried out by QUT’s eResearch arm, and CMR employs a system administrator for technical software support.
You will sit at the TRI, with access to fellow PhD students and the supervisor Dr Woodcroft, and be given access to the CMR compute cluster.
Expected outcomes include developing:
- an understanding of the metadata landscape for public shotgun metagenomes
- machine learning algorithms to predict sample characters from their microbiomes.
Skills and experience
Some knowledge of one or more programming languages (e.g. Python or R) is required. An understanding of basic biology or machine learning would also be advantageous.
Contact the supervisor for more information.