Topic status: We're looking for students to study this topic.

The World Wide Web (the Web) has grown quickly during the last two decades and it is still growing. In 2008, The Google Blog reports that the company has discovered 1 trillion web pages, and they have not covered it all. The Web has become one of the largest databases and the most significant source of information.

However, finding specific information among this tremendous amount of data is frustrating.Search engine technologies aim to address this crucial and challenging task in order to allow people to exploit the Web easily. One such technique is to classify web pages into pre-defined categories such as pre-defined geographical zones, time, or topic categories.

This project aims to investigate techniques for classifying web documents according to their topics. A topic is a set of terms representing the content of a web page. The main tasks include topic extraction and classification. To simplify the process, we assume every web page has one topic and will be classified into one single category. The expected outcomes of this project include new approaches for topic-based web page classification. This result can be used for different purposes and domains such as the development of focused crawlers, distributed information retrieval, or document summarisation etc.

Study level
PhD, Masters, Honours
Organisational unit

Science and Engineering Faculty

Research areas
document classification , document clustering, document topic, web , document

For more information, please contact Dr. Jinglan Zhang.