Enhancing Breast Cancer Prediction Using Unlabeled Data

Name
Khatia Kilanava
Abstract
The following thesis presents a deep learning (DL) approach for automatic classification of invasive ductal carcinoma (IDC) tissue regions in whole slide images (WSI) of breast cancer (BC) using unlabeled data. DL methods are similar to the way the human brain works across different interpretation levels. These techniques have shown to outperform traditional approaches of the most complex problems such as image classification and object detection. However, DL requires a broad set of labeled data that is difficult to obtain, especially in the medical field as neither the hospitals nor the patients are willing to reveal such sensitive information. Moreover, machine learning (ML) systems are achieving better performance at the cost of becoming increasingly complex. Because of that, they become less interpretable that causes distrust from the users. Model interpretability is a way to enhance trust in a system. It is a very desirable property, especially crucial with the pervasive adoption of ML-based models in the critical domains like the medical field. With medical diagnostics, the predictions cannot be blindly followed as it may result in harm to the patient. IDC is one of the most common and aggressive subtypes of all breast cancers accounting nearly 80% of them. Assessment of the disease is a very time-consuming and challenging task for pathologists, as it involves scanning large swatches of benign regions to identify an area of malignancy. Meanwhile, accurate delineation of IDC in WSI is crucial for the estimation of grading cancer aggressiveness. In the following study, a semi-supervised learning (SSL) scheme is developed using the deep convolutional neural network (CNN) for IDC diagnosis. The proposed framework first augments a small set of labeled data with synthetic medical images, generated by the generative adversarial network (GAN) that is followed by feature extraction using already pre-trained network on the larger dataset and a data labeling algorithm that labels a much broader set of unlabeled data. After feeding the newly labeled set into the proposed CNN model, acceptable performance is achieved: the AUC and the F-measure accounting for 0.86, 0.77, respectively. Moreover, proposed interpretability techniques produce explanations for medical predictions and build trust in the presented CNN. The following study demonstrates that it is possible to enable a better understanding of the CNN decisions by visualizing areas that are the most important for a particular prediction and by finding elements that are the reasons for IDC, Non-IDC decisions made by the network.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Prof. Sherif Aly Ahmed Sakr, Dr. Radwa El Shawi
Defence year
2019
 
PDF