ABSTRACT
This paper proposes an efficient and accurate method to predict coronavirus disease 19 (COVID-19) based on the genome similarity of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and a bat SARS-CoV-like coronavirus. We introduce similarity features to distinguish COVID-19 from other human coronaviruses by comparing human coronaviruses with a bat SARS-CoV-like coronavirus. In the proposed method each human coronavirus sequence is assigned to three similarity scores considering nucleotide similarities and mutations that lead to the strong absence of cytosine and guanine nucleotides. Next the proposed features are integrated with CpG island features of the genome sequences to improve COVID-19 prediction. Thus, each genome sequence is represented by five real numbers. We exhibit the effectiveness of the proposed features using six machine learning classifiers on a dataset including the genome sequences of human coronaviruses similar to SARS-CoV-2. The performances of the machine learning classifiers are close to each other and k-nearest neighbor classifier with similarity features achieves the best results with an accuracy of 99.2%. Moreover, k-nearest neighbor classifier with the integration of CpG based and similarity features has an admirable performance and achieves an accuracy of 99.8%. Experimental results demonstrate that similarity features remarkably decrease the number of false negatives and significantly improve the overall performance. The superiority of the proposed method is also highlighted by comparing with the state-of-the-art studies detecting COVID-19 from genome sequences.
Fuente: Computers & Industrial Engineering
Volume 161, November 2021, 107666