About SUBA4
What is SUBA?
SUBA provides a powerful tool to investigate subcellular localisation in Arabidopsis through the unification of disparate datasets and through the provision of web services through our accessible interface. Users can construct powerful queries or interrogate their protein sets resulting in a one-stop-shop for protein localisation and protein location relationships in the Arabidopsis model plant. SUBA houses large scale proteomic, GFP localisation, Protein-Protein Interaction (PPI) data as well as PPI localisation data from subcellular compartments of Arabidopsis. SUBA4 also contains precompiled bioinformatic predictions for protein subcellular localisations and a consensus call taking predictive and experimental information into account. The SUBA4 search interface and SUBA4 toolbox provides flexible options of refining or interrogating protein data sets by location, expected abundance, interactions, coexpression, protein properties and bibliographic information.
Why SUBA? Subcellular localisation information can contribute towards our understanding of protein function, protein redundancy and of biological relationships. While a variety of technologies are currently employed to determine the subcellular location of proteins, much of this information is not available in an integrated manner. In an attempt to get a clearer picture of our experimental data and to more generally understand subcellular partitioning we have brought together and expanded various data sources to build SUBA. The database has a web accessible interface that allows advanced combinatorial queries on the data as well as downloads for downstream applications.
The resources in SUBA4
SUBA4 experimental data
SUBA4 is updated at least once a year and numbers of experimental data sets changes with every update. The current version of SUBA is built on the TAIR10 Arabidopsis proteome and has been described in detail in the current database issue of Nucleic Acids Res. An overview of the data in SUBA4 as of June 2016 is shown below.
SUBA4 predictors
SUBA4 contains 22 predictors which use distinct training data sets, input variables and prediction methods. These have been reviewed and compared for their contribution to the SUBA consensus call in our recent study about SUBAcon. Predictors vary in their accuracy for each subcellular compartment. Using this table you can find the most useful single predictor for your chosen compartment. The classifyer SUBAcon achieves the highest accuracy across all 10 subcellular categories.
SUBA4 tool box
The SUBA4 tool box is an interactive analysis centre that contains the Multiple Marker Abundance Profiling (MMAP) tool, The PPI Adjacency Tool (PAT) and the Coexpression Adjacency Tool (CAT). Linking the SUBAcon data to protein abundance or to protein-protein relationships enables users to spatially interprete their data. The relative abundance and purity of protein samples can be estimated, PPI data sets can be refined, spatial co-expression networks and more can extracted by simply entering a set of AGIs and clicking a button.
SUBA consensus (SUBAcon) locations
Abundant experimental data from fluorescent protein (FP) tagging or mass spectrometry (MS) are available for Arabidopsis, yet they only cover ~ 30% of the proteome. For the remaining 70% of proteins, many computational tools have been developed to predict proteome-wide subcellular location. None of the mentioned approaches are error-free and thus results are often contradictory. To help unify the multiple data contained in SUBA4, we have developed the SUBcellular Arabidopsis consensus ( SUBAcon) algorithm, a naive Bayes classifier that integrates 22 computational prediction algorithms, experimental GFP and MS localizations, protein-protein interaction and co-expression data to derive a consensus call and probability. SUBAcon classifies protein location in Arabidopsis more accurately than single predictors. SUBAcon is a useful tool for recovering proteome-wide subcellular locations of Arabidopsis proteins. More info about SUBAcon
Arabidopsis SUbcellular REference (ASURE)
ASURE is the reference data used for training SUBAcon and its built is described our recent study about SUBAcon. ASURE contains 5,393 proteins of which 2894 (53%) have been independently experimentally localized. Because experimental (GFP, MS) data were introduced in the SUBAcon classification algorithm, the assembly of ASURE sub-proteomes used additional inclusion criteria for curated ASURE proteins such as protein function and evidence from orthologoes in other species. ASURE showed a discrepancy of less then 1% compared to the high-confidence Arabidopsis plastid proteome and to the peer-reviewed reference sets used for training the predictors MultiLoc2, EpiLoc and YLoc. ASURE is a searchable high confidence subproteome reference standard that can be accessed through the ASURE portal.