Explainable AI in Medical Diagnoses using Weakly Supervised Segmentation and sub-network training

Professor Sivaswamy's Lab, CVIT, IIIT Hyderabad

While adoption of AI into a plethora of fields can seem promising and easy going, it's not the case in medical domain due to the black box approach the deep learning approaches employ. This work is aimed at bridging the trust gap between AI based diagnoses and clinician bases diagnoses. Using novel training methodology and architecture, we propose to provide explanations to the decisions. We propose a single architecture which can deal with a data set which is fully labeled but only a fraction of those images have local annotations.

Feature Engineering using Protein/Peptide sequences

Computing wide range of protein/peptide features from their sequence and structure
Akshara Pande, Sumeet Patiyal, Anjali Lathwal, Chakit Arora, Dilraj Kaur, Anjali Dhall, Gaurav Mishra, Harpreet Kaur, Neelam Sharma, Shipra Jain, et al.
bioRxiv Preprint, April 2019

pdf, webserver, code

In order to facilitate scientific community, a software package that computes more than 50,000 features has been developed, important for predicting function of a protein and its residues. It has five major modules for computing; composition-based features, binary profiles, evolutionary information, structure-based features and patterns. The composition-based module allows user to compute the following(non-exhaustive):

  • Miscellaneous compositions like pseudo amino acid, autocorrelation, conjoint triad, quasi-sequence order.

  • Shannon entropy to measure the low complexity regions;

  • Repeats and distribution of amino acids

  • simple compositions like amino acid, dipeptide, tripeptide;

  • Properties based compositions;

Binary profile of amino acid sequences provides complete information including order of residues or type of residues; specifically, suitable to predict function of a protein at residue level. Pfeature allows one to compute evolutionary information-based features in form of PSSM profile generated using PSIBLAST. Structure based module allows computing structure-based features, specifically suitable to annotate chemically modified peptides/proteins.In last three decades, a wide range of protein descriptors/features have been discovered to annotate a protein with high precision. A wide range of features have been integrated in numerous software packages (e.g., PROFEAT, PyBioMed, iFeature, protr, Rcpi, propy) to predict function of a protein. These features are not suitable to predict function of a protein at residue level such as prediction of ligand binding residues, DNA interacting residues, post translational modification etc.

Protein-Ligand Interaction

Sambinder: A web server for predicting sam binding residues of a protein from its amino acid sequence
Piyush Agrawal, Gaurav Mishra, and Gajendra PS Raghava
bioRxiv Preprint, May 2019

pdf, webserver, dataset, code

One of the primary hurdles in today's world of science is the annotation of a protein at structural as well as functional level. Due to the swift headway in sequencing innovations, the number of protein sequences are increasing at an exponential rate in the respective databases, but they lack annotation, and this gap is increasing every moment. Therefore, there is a pressing necessity for the development of computational methods, which can determine the function of the proteins at the residue level. The interaction between the proteins and their ligands is crucial for the well‐being of an organism. In the last few decades, substantial efforts have been made toward the identification of the ligand‐binding residues in a protein as shown in the review by Sousa et al. At first, nonspecific methods were developed to predict the binding sites or pockets in the proteins, irrespective of their ligands but soon it was realized that each ligand possesses different physical and chemical properties. Therefore, new computational methods specific to the ligands came into the picture,6-10 which performed better as compared to nonspecific methods.

S-adenosyl-L-methionine (SAM) is one of the important cofactor present in the biological system and play a key role in many diseases. There is a need to develop a method for predicting SAM binding sites in a protein for designing drugs against SAM associated disease. Best of our knowledge, there is no method that can predict the binding site of SAM in a given protein sequence. This manuscript describes a method SAMbinder, developed for predicting SAM binding sites in a protein from its primary sequence. All models were trained, tested and evaluated on 145 SAM binding protein chains where no two chains have more than 40% sequence similarity. Firstly, models were developed using different machine learning techniques on a balanced dataset contain 2188 SAM interacting and an equal number of non-interacting residues. Our Random Forest based model developed using binary profile feature got maximum MCC 0.42 with AUROC 0.79 on the validation dataset. The performance of our models improved significantly from MCC 0.42 to 0.61, when evolutionary information in the form of PSSM profile is used as a feature. We also developed models on realistic dataset contains 2188 SAM interacting and 40029 non-interacting residues and got maximum MCC 0.61 with AUROC of 0.89. In order to evaluate the performance of our models, we used internal as well as external cross-validation technique.

NAGbinder: An approach for identifying n-acetylglucosamine interacting residues of a protein from its primary sequence
Sumeet Patiyal, Piyush Agrawal, Vinod Kumar, Anjali Dhall, Rajesh Kumar, Gaurav Mishra, Gajendra PS Raghava
Protein Science, 2019

pdf, webserver, dataset, code

In this study, we made a methodical attempt to predict the NAG interacting residues in the given protein sequence. We believe this study will be advantageous to the researchers working in the field of drug discovery. In order to facilitate the scientific community, a web server and standalone software has been developed for predicting the NAG interacting residues in a protein.

The monosaccharide N‐acetylglucosamine (NAG) is ubiquitous in the environment. It is known for playing essential structural roles at the cell surface ranging from bacteria to human. It is the principal component of the bacterial cell wall peptidoglycan, and of chitin in the fungal cell wall. Glycosaminoglycans are also present on the extracellular matrix in animal cells. It is also involved in processes such as cell signaling in fungi and bacteria, and regulation of gene expressions. Plants and animal cells also use NAG for cell signaling and act as the sensors for the status of the nutrition that lead to the modification of the protein by the attachment of O‐GlcNAc. In recent years, NAG is suggested as a treatment for autoimmune disorders. In the case of human, NAG signaling facilitates the coexistence of the extensive range of bacteria, fungi, and human cells in the gut.

In-Silico Drug Discovery using Protein-Small Molecule Interaction
Gaurav Mishra, Gajendrs PS Raghava
Shiv Nadar University Library, May 2019


Proteins are the fundamental players of all living cells and play a vital role in various cellular functions. Protein function is primarily determined by its struc- ture. Interaction of protein with other molecules for example, ligand or other small molecules imparts its specific function. Protein-small molecule interaction is one such interaction which has been studied in detail in past in the field of drug designing. These small molecules carry out various processes such as sig- nal transmission, post-translational modification, etc. therefore it is important to understand the protein-small molecule interaction in order to design novel drugs. The project aims at creating an in-silico model drug discovery model by predict- ing if a particular amino acid in a given protein would bind to a ligand. We have chosen the ligand as Uridine-5’-diphosphate. The ligand imparts many important functionalities to the proteins it interacts with. Various machine learning algo- rithms have been used to generate predictive models which successfully do the required task. In this study, we will make an attempt to annotate protein at residue level. We will select biologically important ligand to find out interaction with given protein sequences by implementing machine learning techniques including. The application of this work will be in drug designing where we will target the proteins which are involved in various diseases like cancer, tuberculosis, etc.

Ensemble Learning in Regression Applications

Gaurav Mishra, Karan Praharaj, Madan Gopal

This project aims to assess the performance and usability of ensemble regression techniques against the performance of other machine learning algorithms by comparing the results using test mean squared error as the metric for comparison. The goal of ensemble regression is to combine several models in order to improve the prediction accuracy on learning problems with a numerical target variable. Many methods for constructing ensembles have been developed. The method used in this project is constructing ensembles that manipulate the training examples to generate multiple hypotheses. For experimental evaluation, we have used a house value prediction problem and a stock index prediction problem. Specifically, the bagging, boosting and random forests ensemble methods were used in addition to the standard machine learning. In terms of test accuracy, it was found that Boosting (AdaBoost) and Bagging perform the best of all the algorithms for the housing problem. We expect that the learning attained from these results will provide a useful platform for our further work on the Stock Index Prediction problem.

alt text