IIIT Hyderabad Publications |
|||||||||
|
Multi-class Classification of Malaria Parasite Life cycle using Single-cell TranscriptomesAuthor: Swarnim Shukla Date: 2023-06-16 Report no: IIIT/TH/2023/85 Advisor:Bhaswar Ghosh AbstractMalaria, which is spread by the female anopheles mosquito, is a highly fatal disease that affects many parts of the world, with up to 0.4 million deaths reported worldwide. The detection of malaria infection levels is based on vital gene expressions. Experts quantify malaria parasite-infected RBCs and classify their life cycle stages at the macroscopic level in order to make informed decisions. Several computational approaches have recently been proposed to avoid the dimensionality problem and produce accurately predicted results. Our study presents a theoretical framework to select diagnosis markers and drug targets by implementing ML techniques on sc-RNA-seq data. The main objective is to select the top-ranked genes from the scRNA-seq profiles at different stages of the Plasmodium falciparum (Pf) life cycle inside infected RBC. We employ a supervised learning algorithm coupled with feature selection algorithms to extract the most relevant genes to predict the life cycle stages of Pf inside RBC. The first stage of modeling is to optimize the quality of data from the dataset (5066 features) by removing the irrelevant features. Genetic Algorithm (GA) based search technique is popularly used for feature selection and dealing with high dimensionality datasets. This reduced subset (378) is further utilized in the second stage of high accuracy multi-class classification. In this work, a GA-based dimensionality reduction technique is used on single-cell transcriptomics to obtain an optimised subset of features from a larger data set. To separately transform the selected elements into a lower dimension, features are chosen based on their class variants, taking into account increased efficiency and accuracy. We constructed the protein-protein interaction network (PPIN) of these genes and performed topological analysis using the Search Tool for the Retrieval of Interacting Genes/ Proteins database (STRING 11.0 b) and Gephi software to provide hierarchies according to the importance of the genes in the network. Various topological measures are estimated to evaluate the node characteristics in the PPINs, including degree, between centrality, eccentricity, closeness centrality, eigenvector centrality, and clustering coefficient. Proteins having a high degree and betweenness centrality tend to assert more control over the network function. We also performed gene ontology analysis to determine the role of proteins in the parasite’s life cycle progression. For the multi-class classification of the life cycle of malaria parasite based on oriented gradients and local binary pattern features, a three-pronged approach employing the multi-class Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) techniques are used. On using these 378 features, RF performed best with a classification accuracy of 92% while SVM had a 91% vi vii accuracy and LR gave 88% accuracy. By merely using the 378 features, we achieved similar or better performance scores for all four classes, across all three models. Further, randomly chosen features from our dataset of 378 were also evaluated using the SVM, LR, and RF models. We achieved an accuracy of 81%, 79%, and 80% for the three respective models. This proves the robustness of the features selected using the GA-based approach. The proposed research methodology can be likely used for improved malaria diagnosis and drug targets. Full thesis: pdf Centre for Computational Natural Sciences and Bioinformatics |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |