Operational Technology (OT) is a field of computing which is becoming increasingly prominent in modern society. It is responsible for a variety of critical services, especially in industrial contexts, including power generation, manufacturing, transport, and many others. This important role makes OT an especially tempting target for malicious attackers. In order to counter this, tools must be developed to locate vulnerabilities and flaws in OT software systems before attacks can be launched. Vulnerability discovery in computer software systems including OT systems, however, is a challenging problem which is yet to be solved.
Recently deep learning based models were proposed for vulnerability discovery in software systems. One important reason for the emerging of deep learning models is due to their ability to capture the semantics in source code. Deep learning models have the capability to discover latent features representing the meaning of the code that human experts may never be able to define. However, the existing deep learning models are mainly developed for the vulnerability discovery part, not for source code representation (also called code embedding). This project investigates software vulnerability discovery based on source code embeddings using deep learning.
In this project, we will conduct an investigation to evaluate existing deep learning based vulnerability detection models and explore the effectiveness of semantic-based code embeddings for vulnerability discovery in OT networks.
Specifically, the project aims to:
- Adapt Code2Vec method to generate code embeddings for representing source code semantically
- evaluate the impact of code semantics on the accuracy of vulnerability discovery using supervised classification based-models
- develop and evaluate a semi-supervised method to identify vulnerabilities in a large unlabelled dataset based on the code embeddings learnt from a small labelled dataset.
Upon conclusion of this research project, we expect:
- To have improved models or algorithms to generate code embeddings for representing source code semantically
- to develop a semi-supervised method to identify vulnerabilities in a large unlabelled dataset based on the code embeddings learnt from a small labelled dataset.
Skills and experience
To be considered for this project, we expect you to have:
- knowledge of data mining and machine learning
- knowledge of networking
- good programming skills (preferably Python, C#)
Contact the supervisor for more information.