Overview

Topic status: We're looking for students to study this topic.

Bioinformatics is generally concerned with the analysis of genetic sequences and the prediction of their function or regulatory systems that control their action. Genomics is a maturing field and substantial progress has been made in elucidating the mechanisms which underpin cellular processes. However, a number of developments in sequencing technologies have led to an exponential increase in the number of sequenced genomes, and this explosion in data availability has changed the character of the studies being undertaken. Bioinformatics is increasingly concerned with the comparative analysis of multiple genomes, and with the use of data and confirmed relationships from other organisms in the investigation of data which is being explored for the first time.

Nevertheless, the problems remain similar and we may use machine learning algorithms trained on known examplars in order to predict functional sub sequences. These techniques may be based on probabilistic or discriminative (classification-based) approaches, but the problems are inherently difficult due to the high levels of noise in the data. . One possible approach to overcome the limitations of the data is to exploit text-based information from the published literature in the domain. Almost all of this material is collected into on-line databases such as PubMed, and this may provide additional knowledge able to improve the accuracy of current approaches. This project aims at exploiting the information in existing literature and using it to bias and correct relationships predicted within the discovery process - usually by updating probabilistic expectations currently derived from sequence observation.

Approaches: Analysis of the Medline corpus with a mixture of methods inherited from information extraction and semantic spaces will constitute the main initial approach - establishing a knowledge base capable of supporting the machine learning studies. Information extraction tools based on conditional random fields or hidden markov models will be employed in order to identify the tokens of text most relevant to the domain, such as names of proteins or functions. Semantic spaces are mainly based on co-occurence matrices and have been developed to uncover semantic relationships between key terms. While tools exist in both these areas, their use and adaptation to the biology domain will constitute key research components, as will the use of these literature models to bolster predictive methiods.

References: Please see www.mquter.qut.edu.au/bio for a description of the problem domain and some references to earlier work in bioinformatics and visualisation of biological data sets.

Study level
Honours
Supervisors
QUT
Organisational unit

Science and Engineering Faculty

Research area

Computer Science

Contact

Please contact the supervisor.