Work Packages CeProVI

Computational and Experimental Protein Variant Interpretation

Work Packages


Project Acronym:	CeProVI
Project Code:	MIUR-PRIN-201744NR8S
Project Title:	Integrative tools for defining the molecular basis of the diseases: Computational and Experimental methods for Protein Variant Interpretation.

WP1: Development of an integrated protein variation database (Months 1-36)

This WP is one of the pillars of our project. All the units will collaborate to the database implementation, starting from public resources, enriching the database with information derived from new experiments and computational annotations. The main aim of WP1 consists of the collection of high quality SAVs data including their impact on protein structure, stability, protein-protein binding affinity and their association to diseases. This will be a significant step forward to the generation of an integrated resource incorporating thermodynamic data and the structural and functional effects of SAVs in a unique framework. This WP includes the development of a semi-automatic tool for extracting experimental data from the literature and update the database. To perform this task we will implement a semiautomatic method based on Natural Language Processing (NLP) techniques. The algorithm will scan each manuscript abstract in PubMed and will return a score related to the probability of finding experimental measures of protein stability in the fulltext version of the paper. We will start by aggregatin g heterogeneous sources of data from publicly available repositories (such as Protein Data Bank (www.wwpdb.org/), ProTherm (www.abren.net/protherm/), SKEMPI (life.bsc.es/pid/mutation_database/), PMD (pmd.ddbj.nig.ac.jp/~pmd/pmd.html), ClinVar (www.ncbi.nlm.nih.gov/clinvar/), SwissVar (swissvar.expasy.org/), OMIM (www.omim.org)) on a common database. As a first step we will filter the ProTherm d atabase by removing all detected imprecisions and mistakes starting from a cleaned list of variants recently released on VariBench (http://structure.bmc.lu.se/PONTstab/). On a second step, we will include new experimental data in a variant set of P53 [30] and Myglobin [31]. Finally, we will aggregate all information from PMD, SKEMPI and ClinVar and SwissVar. During this process, we will generate a preliminary version of the database that will be used for testing the accuracy of a new algorithm for predicting the effect of mutations on protein stability (WP2). Finally we will extend the database with new experimentally derived information by selecting specific proteins with va riants reported in COSMIC and ClinVar databases that satisfy the following criteria: i) the experimental 3D structure is available and deposited in the PDB repository and ii) the proteins and their variants can be expressed in E.coli. The final database will be released as web-based public resource and will be freely accessible.

WP2: Development of the Predictors (Months 5-30)

We will develop a new computational tool for predicting the impact of protein variants integrating our stateoftheart predictors (PhDSNPg,SNPs&GO, I-Mutant, and INPS). We will set up a specific Information Technology (IT) infrastructure that will host the software and the data for the project. In this phase, UNIT1 and UNIT2 will share the software and they will agree on the datasets to use for software integration and testing. Once the dataset is completed, we will generate non-redundant cross-validation sets, for properly training and testing the new methods. The next most important step is the generation of multiple sequence alignments of homologues for each protein in our database. The evolutionary data within the multiple sequence alignments can be exploited in various ways towards predicting the variation effect on thermal stability. The Position Specific Substitution Matrix (PSSM) of each protein, obtained from the multiple alignment, is readily useful for the derivation of hidden Markov models. Each multiple alignment can also be used within ConSurf to estimate the relative evolutionary rate of each amino acid position. The PSSM, evolutionary rates and other features will be used to train a machinelearningbased method, and also an ensemble of methods. We will ad dress the problems of predicting the DDG of both single and multiple variations. To cope with the different numbers of possible variations that can concur to the final DDG values, we will use a multiinstance learning approach, using both msiSupport Vector machines, msi-NNs, 1D-convolutional NNs and Recurrent NN. We will also develop a new predictor for the binding affinity changes using the SKEMPI data. We will introduce statistical potentials modulated by evolutionary information, together with new descriptors derived by physics based computational methods (UNIT5). We will test RandomForest, SVM, gradientboosting and bagging of linear regressors. The best method, or an ensemble of all of them, will be our final tool. The main novelty of the new method will consist i n the integration of sources of information from different variation databases allowing to capture complementary aspects of the relationship between protein stability, function and disease. These results will be obtained by the concerted work of the UNIT1 and UNIT2, that will share the starting software and focus on complementary aspects of the problem.

WP3: Generation of new experimental data: structural, functional and stability (Months 1-32)

This WP focuses on the generation of fundamental experimental information. UNIT3 and UNIT4 will express and biophysically characterize wild type and natural protein variants carrying missense mutations and related to pathological states in humans, with a main focus on the somatic nonsynonymous single nucleotide variants (nsSNVs) observed in cancer cells (UNIT4) and Ca2+-signalling related genetic diseases (UNIT3). Proteins will be heterologously expressed in E. Coli and purified; established experimental techniques will be used to measure changes in stability with respect to the wild type, which include near and far UV circular dichroism (CD) (UNIT 3 and UNIT4), differential scanning calorimetry (UNIT3), fluorescence spectroscopy (UNITS 3 and 4), FTIR (UNIT 4), and Nuclear Magnetic Resonance (1H15N HSQC experiments, UNIT3). UNIT3 will focus mostly on 11 point mutations found in CaM (see Fig 1 panel A) found in genetic diseases where Ca2+ signaling is perturbed, namely catecholaminergic polymorphic ventricular tachycardia (CPVT) and long QT syndrome (LQTS), leading to cardiac arrest. It will also characterize some of the 16 CaM SAVs associated to cancer. UNIT4 will focus on SAVs found in cancer tissues of protein kinases MAPK1, 3, 6,8,11and phosphatases PTPN4,11,14 that regulate the balance between protein phosphorylation and dephosphorylation and control a number of biological processes. Perturbation of this balance, as a consequence of alteration of protein kinases or phosphatases structure and/or activity, may result in abnormal cellular processes such as the uncontrolled proliferation and tumorigenesis. UNIT4 will express selected samples of nsSNVs found in pathological tissues and reported in databases (COSMIC, ClinVar, OMIM) and will measure the DG and DDG associated to each variant. The nsSNVs structural properties, the rmal and thermodynamic stability and function will be compared with those of the corresponding wildtype proteins. To obtain insight on changes in the three dimensional structure, as well as local interactions of the mutated side chains, UNIT4 will provide the selected variants to UNIT3 and to UNIT5 that will perform structural analysis by NMR and molecular dynamics, respectively. The structural information will allow to understand how local changes induced by mutations influence the mobility and dynamics of the variant. The main novelty of this study will consist in the collection of experimental protein structure and stability data for sets of variants associated with cancer and genetic defects in calcium signaling.

WP4: Generation of new experimental data of binding affinity variations (Months 5-32)

The effects of nsSNV on the interaction between the affected protein and the biological target will be investigated in this WP to estimate the relative binding affinity (DDG = DGmut-DGwt) and provide useful experimental data for WPs 1, 2 and 5. UNIT 3, with essential contributions from UNITS 4 and 5, will be mainly involved in this WP. As a test system, we will study the interaction of wild type and mutated CaM with peptides fr om the ryanodine receptor 1 (RyR1), a CaM target found in skeletal muscle cells, and the myocardiumspecific RyR2 target in order to highlight tissue-specific effects of the observed nsSNVs. Some of the 16 cancer-related somatic missense mutations identified in CALM1 coding for CaM, will also be experimentally and computationally analyzed, given priority to those that most likely affect the interface with binding partners (see WP4). Surface Plasmon resonance (SPR), ITC and NMR will be use d to measure DDG values for the proteins and their relative biological targets. In SPR experiments, the protein of interest will be immobilized on a suitable sensor chip by NiNTA specific affinity binding, while the binding partner is flowed onto the chip at different suitable concentrations and based on the binding-induced changes in refractive index of the metal/dielectric interface the kinetic rate of association and dissociation is measured at real time. By measuring the maximal response at specific ligand concentration, the apparent Kd can be assessed, from which DG is directly derived (DG = RTlnKd). Immobilization of both the wild type and the variant protein in different flow cells will allow direct and indirect estimates of DDG values. The impact of the single amino acid substitution on protein dynamics and interactions will be studied by nonequilibrium molecular dynamics (Jarzynski method). The relative affinity will also be detected b y ITC, measuring the binding constants and the binding enthalpy at a fixed temperature. As a link from calcium signaling and genetic diseases and cancer, we will focus on possible interaction of CaM protein with other proteins of the PI3Kapha/Akt pathways which are important in the insurgence and progression of cancer. The major goal of this WP consists in measuring the binding affinity of the wild type and variant proteins involved in the pathogenic processes described in WP3, which have not been yet characterized.

WP5: Data validation and evaluation of the predictors (Months: 25-36)

Association of structural and functional markers to diseases is performed according to a well established knowledge-based strategy. This approach is often combined with physics-based ones with the aim of strengthening the results and possibly building a rationale for the statistical association. This WP aims at integrating the efforts produced in WP2, WP3 and WP4 by testing the developed computational methods on the newly generated data. The discrepancy between computation and experiments will be used to fine-tuning the predictors. During this WP we will check the possibility of generating “ad hoc” protein-based predictors to assess all possible variations at level of single protein. The first major outcome of this WP is the definition of reliability thresholds for the machine-learning predictors that allow us to infer the impact of the variations. The second relevant product of this WP will be the definition of a pipeline that will allow annotation of new variations. Th e pipeline will start by processing the protein sequence and generating comparative and evolutionary information that will be used to evaluate the impact of the variations on protein stability, affinity binding, and relevant local changes also at a quantum mechanical level. In this WP, we will significantly contribute to the collection of an independent dataset for assessing the performance of methods for predicting the impact of variants on protein stability, binding affinity and disease. This dataset will be released for the Critical Assessment of Genome Interpretation (CAGI) experiments, designed to evaluate the quality of blind predictions submitted by several methods and define the limitations of the state of the art tools.